feat: TTS with RealtimeModel #781

simllll · 2025-10-22T12:54:48Z

Description

Changes Made

Added audioOutput flag bsaed on the python library.
Added/Fixed the modaliities parameter in OpenAI.
implemented mising (text) events in openAI realtime implementatoin
implemented TTS forward in agent for non-realtime audio output / =TTS output

For the last part (tts forwarding), I needed to copy the stream.. not sure if this is the best way.. it works and I only do it in this mode, otherwise I had issues with text and tts stream forwarding. But I couldn't see any of this in the python library, maybe a python stream can be consumed more than once?

Pre-Review Checklist

Build passes: All builds (lint, typecheck, tests) pass locally
AI-generated code reviewed: Removed unnecessary comments and ensured code quality
Changes explained: All changes are properly documented and justified above
Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included

Testing

Automated tests added/updated (if applicable)
All tests pass
Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)
Manual tests succeeded

Additional Notes

I only tested it with the OpenAI Realtime Model, the gemini model is definilty missing some parts. We should also check if we need a backwards-compatiblity mode, because right now I check for the new flag audioOutput (which I added in google and openai). But not sure if we are missing it somewhere else.

Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

changeset-bot · 2025-10-22T12:54:52Z

🦋 Changeset detected

Latest commit: f962dd4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 14 packages

Name	Type
@livekit/agents-plugin-google	Patch
@livekit/agents-plugin-openai	Patch
@livekit/agents	Patch
@livekit/agents-plugin-anam	Patch
@livekit/agents-plugin-cartesia	Patch
@livekit/agents-plugin-elevenlabs	Patch
@livekit/agents-plugin-neuphonic	Patch
@livekit/agents-plugin-resemble	Patch
@livekit/agents-plugin-rime	Patch
@livekit/agents-plugin-bey	Patch
@livekit/agents-plugin-deepgram	Patch
@livekit/agents-plugin-livekit	Patch
@livekit/agents-plugin-silero	Patch
@livekit/agents-plugins-test	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

CLAassistant · 2025-10-22T12:54:56Z

All committers have signed the CLA.

see https://platform.openai.com/docs/api-reference/realtime-beta-sessions/session_object

output_modalities array The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.

simllll · 2025-10-23T09:07:56Z

I added some missing events that are reuqired for text modality, I finally get a valid text response from the LLM. ~~But unfortauntely something is still missing, my elevenlabs TTS is not doing anything.. like it's waiting for something or not getting triggered.~~ found the issue: 978613b

toubatbrian · 2025-10-24T06:16:52Z

agents/src/voice/agent_activity.ts

+
+          // Determine if we need to tee the text stream for both text output and TTS
+          const needsTextOutput = !!textOutput && !!trNodeResult;
+          const needsTTSSynthesis =
+            audioOutput &&
+            this.llm instanceof RealtimeModel &&
+            !this.llm.capabilities.audioOutput &&
+            this.tts;
+          const needsBothTextAndTTS = needsTextOutput && needsTTSSynthesis;
+
+          // Tee the stream if we need it for both purposes
+          let textStreamForOutput = trNodeResult;
+          let textStreamForTTS = trNodeResult;
+          if (needsBothTextAndTTS && trNodeResult) {
+            const [stream1, stream2] = trNodeResult.tee();
+            textStreamForOutput = stream1;
+            textStreamForTTS = stream2;
+          }
+


Let's make sure the implementation mirrors python implementation when doing refactoring: https://github.com/livekit/agents/blob/a9bc03562f498f3666978ad008fc93b2cbbd22a9/livekit-agents/livekit/agents/voice/agent_activity.py#L2011-L2102

toubatbrian · 2025-10-24T06:21:47Z

agents/src/voice/agent_activity.ts

+          let textStreamForOutput = trNodeResult;
+          let textStreamForTTS = trNodeResult;
+          if (needsBothTextAndTTS && trNodeResult) {
+            const [stream1, stream2] = trNodeResult.tee();


We need to be super careful whenever doing an tee operation. It will lock the source stream untill both tee-ed streams have finished or been cancelled. In case of an interruption, we would have to release reader lock for both sub-streams to release resources and cannot simply cancel the trNodeResult

toubatbrian

Added some comments

fix modalities parameter for openai

4361aa6

simllll added 2 commits October 22, 2025 16:36

use correct object key: output_modalities instead of modalities

a5f005b

see https://platform.openai.com/docs/api-reference/realtime-beta-sessions/session_object

implement missing text events

1f170bf

simllll force-pushed the fix-modalities branch from d148422 to 1f170bf Compare October 23, 2025 09:08

simllll added 2 commits October 23, 2025 11:32

use same modalities

a106a95

implement custom tts support in realtime agent-activity

978613b

simllll changed the title ~~fix modalities parameter for openai~~ implement TTS with realtime Oct 23, 2025

simllll changed the title ~~implement TTS with realtime~~ implement TTS with RealtimeModel Oct 23, 2025

simllll changed the title ~~implement TTS with RealtimeModel~~ feat: TTS with RealtimeModel Oct 23, 2025

add changeset

f962dd4

toubatbrian reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: TTS with RealtimeModel #781

feat: TTS with RealtimeModel #781

simllll commented Oct 22, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Oct 22, 2025 •

edited

Loading

Uh oh!

simllll commented Oct 23, 2025 •

edited

Loading

Uh oh!

toubatbrian Oct 24, 2025

Uh oh!

toubatbrian Oct 24, 2025

Uh oh!

toubatbrian left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: TTS with RealtimeModel #781

Are you sure you want to change the base?

feat: TTS with RealtimeModel #781

Conversation

simllll commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

Pre-Review Checklist

Testing

Additional Notes

Uh oh!

changeset-bot bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

CLAassistant commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simllll commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toubatbrian Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

toubatbrian Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

toubatbrian left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simllll commented Oct 22, 2025 •

edited

Loading

changeset-bot bot commented Oct 22, 2025 •

edited

Loading

CLAassistant commented Oct 22, 2025 •

edited

Loading

simllll commented Oct 23, 2025 •

edited

Loading