Skip to content

Conversation

@RobertAgee
Copy link
Contributor

@RobertAgee RobertAgee commented Apr 29, 2025

Title

40% Less VRAM Usage, Slightly Faster Inference, and Improved Voice Consistency – Optimized Text Autochunking

Important

Changes

  • ➕ Added smart autochunking of user text input based on effective character length (~48–64 chars)
  • Supports Tensor Core acceleration:
    • Autochunking ensures sequence lengths are hardware-aligned (~64 tokens)
    • Batching improves GPU occupancy and compute efficiency
  • 🧠 Refactored model.generate() and _prepare_generation():
    • Now accepts audio_prompt_text as a clean, separate parameter
    • Internally joins it with input text to match previous behavior but with cleaner architecture
  • ➕ Autocasts generation to float16 (uses torch.cuda.amp.autocast)
  • 🧹 Includes internal memory cleanup after generation for more stable long-running usage

Why

  • ~40% less VRAM usage (~4GB vs ~7GB) on T4 GPUs.
  • Slightly faster inference (~0.3 RTF vs ~0.33 RTF) due to smarter batching and TensorCore activation.
  • Improved voice consistency when using audio prompts, even across multiple chunks.
  • Cleaner internal design (separated audio prompt vs user text, easier future upgrades).

Notes


Thanks for reviewing this PR! 🚀

@jaehong21 jaehong21 added the enhancement New feature or refactor label Apr 29, 2025
Copy link

@znraznra znraznra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well in my case.

Copy link
Contributor

@buttercrab buttercrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the lint & format

@journeytosilius
Copy link

I get really weird results when using expressions with symbols like $140 ... is this common and what are the guidelines to get it right ?

@RobertAgee
Copy link
Contributor Author

RobertAgee commented May 4, 2025

Hey @buttercrab , the linting and format should now be fixed. I also went ahead and tested the merge and everything seems to work.

A couple notes from testing the merge:
1. Good news, there's much fewer weird gaps of silence in the generations 🚀

2. Voice consistency still needs to be improved. 🔧

  • Currently, each chunk of audio references the uploaded audio+text, with hopes of capturing the voice. I've also tried referencing each previous chunk+text as it generates, but that causes it to gradually morph away from the speaker. It is possible to tune the settings to mitigate this to a small degree but it's not a workable solution. I have a couple other ideas I want to try by inserting speaker tags but I assume simply finetuning a model will be the best and easiest option regardless.

3. Model is much slower now? 🐢

  • While we still have preserved the VRAM savings (~4.2GB baseline, 6.9GB peak on a T4), it seems the other PR changes in the last couple days have made the generations much slower, anywhere from 10-30%, especially on longer audios. Specifically I'm talking about the generations after the initial compilation. Prior the rate was consistent for all lengths of generated audio.

@harmlessman
Copy link

1.	How long did it take to generate around 10 seconds of audio from text?
2.	Did you implement streaming functionality for the TTS output?

@RobertAgee
Copy link
Contributor Author

1.	How long did it take to generate around 10 seconds of audio from text?
2.	Did you implement streaming functionality for the TTS output?

@harmlessman Typically PR requests are not the place to ask these questions, just fyi. However, the information may be useful to @buttercrab.

  1. Depends on your GPU card and set up. For reference, for 10s audio generation, original Dia repo on T4 took ~35 s on average. The new compile Dia takes ~40 s on average. My original PR took ~30 s on average.

  2. Streaming outputs are not currently supported. A much better card(s) are needed to get faster than real time. We could implement buffered outputs, but there'd still be some initial gap between prompt and output, depending on chunk size.

Copy link
Contributor

@buttercrab buttercrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the lint error?

@V12Hero
Copy link
Contributor

V12Hero commented May 13, 2025

@RobertAgee to get around the link error just use ruff like:
ruff check path/to/file --fix --unsafe-fixes
and
ruff format path/to/file

It should help you get around any formatting errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or refactor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants