Skip to content

Conversation

@simllll
Copy link

@simllll simllll commented Oct 22, 2025

Description

fix #772

Changes Made

  • Added audioOutput flag bsaed on the python library.
  • Added/Fixed the modaliities parameter in OpenAI.
  • implemented mising (text) events in openAI realtime implementatoin
  • implemented TTS forward in agent for non-realtime audio output / =TTS output

For the last part (tts forwarding), I needed to copy the stream.. not sure if this is the best way.. it works and I only do it in this mode, otherwise I had issues with text and tts stream forwarding. But I couldn't see any of this in the python library, maybe a python stream can be consumed more than once?

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included

Testing

  • Automated tests added/updated (if applicable)
  • All tests pass
  • Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)
  • Manual tests succeeded

Additional Notes

I only tested it with the OpenAI Realtime Model, the gemini model is definilty missing some parts. We should also check if we need a backwards-compatiblity mode, because right now I check for the new flag audioOutput (which I added in google and openai). But not sure if we are missing it somewhere else.


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

@changeset-bot
Copy link

changeset-bot bot commented Oct 22, 2025

🦋 Changeset detected

Latest commit: f962dd4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 14 packages
Name Type
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-openai Patch
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugins-test Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant
Copy link

CLAassistant commented Oct 22, 2025

CLA assistant check
All committers have signed the CLA.

output_modalities
array

The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
@simllll
Copy link
Author

simllll commented Oct 23, 2025

I added some missing events that are reuqired for text modality, I finally get a valid text response from the LLM. But unfortauntely something is still missing, my elevenlabs TTS is not doing anything.. like it's waiting for something or not getting triggered. found the issue: 978613b

@simllll simllll changed the title fix modalities parameter for openai implement TTS with realtime Oct 23, 2025
@simllll simllll changed the title implement TTS with realtime implement TTS with RealtimeModel Oct 23, 2025
@simllll simllll changed the title implement TTS with RealtimeModel feat: TTS with RealtimeModel Oct 23, 2025
Comment on lines +1523 to +1541

// Determine if we need to tee the text stream for both text output and TTS
const needsTextOutput = !!textOutput && !!trNodeResult;
const needsTTSSynthesis =
audioOutput &&
this.llm instanceof RealtimeModel &&
!this.llm.capabilities.audioOutput &&
this.tts;
const needsBothTextAndTTS = needsTextOutput && needsTTSSynthesis;

// Tee the stream if we need it for both purposes
let textStreamForOutput = trNodeResult;
let textStreamForTTS = trNodeResult;
if (needsBothTextAndTTS && trNodeResult) {
const [stream1, stream2] = trNodeResult.tee();
textStreamForOutput = stream1;
textStreamForTTS = stream2;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let textStreamForOutput = trNodeResult;
let textStreamForTTS = trNodeResult;
if (needsBothTextAndTTS && trNodeResult) {
const [stream1, stream2] = trNodeResult.tee();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be super careful whenever doing an tee operation. It will lock the source stream untill both tee-ed streams have finished or been cancelled. In case of an interruption, we would have to release reader lock for both sub-streams to release resources and cannot simply cancel the trNodeResult

Copy link
Contributor

@toubatbrian toubatbrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Realtime with custom tts

3 participants