-
Notifications
You must be signed in to change notification settings - Fork 166
feat: TTS with RealtimeModel #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: f962dd4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
output_modalities array The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
|
I added some missing events that are reuqired for text modality, I finally get a valid text response from the LLM. |
d148422 to
1f170bf
Compare
|
|
||
| // Determine if we need to tee the text stream for both text output and TTS | ||
| const needsTextOutput = !!textOutput && !!trNodeResult; | ||
| const needsTTSSynthesis = | ||
| audioOutput && | ||
| this.llm instanceof RealtimeModel && | ||
| !this.llm.capabilities.audioOutput && | ||
| this.tts; | ||
| const needsBothTextAndTTS = needsTextOutput && needsTTSSynthesis; | ||
|
|
||
| // Tee the stream if we need it for both purposes | ||
| let textStreamForOutput = trNodeResult; | ||
| let textStreamForTTS = trNodeResult; | ||
| if (needsBothTextAndTTS && trNodeResult) { | ||
| const [stream1, stream2] = trNodeResult.tee(); | ||
| textStreamForOutput = stream1; | ||
| textStreamForTTS = stream2; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure the implementation mirrors python implementation when doing refactoring: https://github.com/livekit/agents/blob/a9bc03562f498f3666978ad008fc93b2cbbd22a9/livekit-agents/livekit/agents/voice/agent_activity.py#L2011-L2102
| let textStreamForOutput = trNodeResult; | ||
| let textStreamForTTS = trNodeResult; | ||
| if (needsBothTextAndTTS && trNodeResult) { | ||
| const [stream1, stream2] = trNodeResult.tee(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be super careful whenever doing an tee operation. It will lock the source stream untill both tee-ed streams have finished or been cancelled. In case of an interruption, we would have to release reader lock for both sub-streams to release resources and cannot simply cancel the trNodeResult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments
Description
fix #772
Changes Made
For the last part (tts forwarding), I needed to copy the stream.. not sure if this is the best way.. it works and I only do it in this mode, otherwise I had issues with text and tts stream forwarding. But I couldn't see any of this in the python library, maybe a python stream can be consumed more than once?
Pre-Review Checklist
Testing
restaurant_agent.tsandrealtime_agent.tswork properly (for major changes)Additional Notes
I only tested it with the OpenAI Realtime Model, the gemini model is definilty missing some parts. We should also check if we need a backwards-compatiblity mode, because right now I check for the new flag audioOutput (which I added in google and openai). But not sure if we are missing it somewhere else.
Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.