You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A round of focused improvements to the orchestration core, driven by what every consumer (speech-swift, speech-android) currently re-implements on top of VoicePipeline and what hurts the perceived quality of the loop. Each item is described functionally — implementation TBD on a per-PR basis.
1. Move mic gating and cooldown into the pipeline
Today push_audio accepts every sample callers feed it, regardless of pipeline state. While the agent is speaking, in cooldown after speaking, or in a force-cut recovery window, callers must substitute silence themselves to keep VAD time-aligned (returning early creates a discontinuity that strands VAD). Every consumer is independently learning this and writing the same gating logic.
Proposal: the pipeline owns the gate. push_audio internally substitutes zero-fill (preserving the VAD timeline) when in speaking, post_playback_guard, or interruption_recovery. Optional sc_pipeline_set_capture_mode(SC_CAPTURE_RAW | SC_CAPTURE_GATED) for callers that need the raw path (e.g. AEC reference signal).
Why: removes ~20 lines of subtle workaround from every consumer, eliminates a class of "VAD goes deaf after TTS" bugs, makes push_audio honest about its contract.
2. Emit state and lifecycle events callers currently back-calculate
The pipeline has a state machine but only exposes it via a sc_pipeline_state() poll accessor. Consumers either poll on a timer or maintain a parallel state machine driven by inference from speech_started / transcription_completed / response_done. The same is true for force-cut: callers detect it by measuring elapsed wall time and comparing against max_utterance_duration.
Proposal: add events:
SC_EVENT_STATE_CHANGED — payload = new state.
SC_EVENT_MAX_UTTERANCE_REACHED — fired when the recording is force-cut at max_utterance_duration, before speech_ended.
SC_EVENT_RESPONSE_TEXT_DELTA — token-level text from the LLM stream, separate from audio deltas, so UIs can show streaming captions.
SC_EVENT_PLAYBACK_STARTED / SC_EVENT_PLAYBACK_FINISHED — distinct from response_created / response_done, with sample count + duration in payload, so consumers don't track durations themselves.
Why: eliminates parallel state machines in consumer code, removes brittle elapsed-time back-calculation, unlocks streaming-caption UIs that are table stakes for modern voice agents.
3. Float32 audio across the C ABI
SC_EVENT_RESPONSE_AUDIO_DELTA currently carries PCM16-LE bytes. TTS implementations natively produce Float32; consumers playing through CoreAudio/AudioTrack also want Float32. The pipeline converts Float32 → PCM16 to emit the event; the consumer immediately converts PCM16 → Float32 to play.
Proposal: add a const float* audio_f32 + size_t audio_f32_length to sc_event_t, document audio_data as a legacy PCM16 view kept for binary compatibility. New consumers read the Float32 field directly.
Why: removes two allocations + two passes per audio chunk on the hot path, drops a precision round-trip, simplifies bridge code.
4. Use the streaming-STT vtable in the orchestrator
sc_stt_vtable_t already declares begin_stream / push_chunk / flush_stream / end_stream / cancel_stream, but the pipeline only invokes batch transcribe after VAD-offset. STT implementations that support streaming (e.g. Parakeet streaming) cannot be exercised through the pipeline today.
Proposal: if the STT vtable provides streaming entry points, the pipeline:
calls begin_stream on speech_started,
forwards audio chunks via push_chunk while listening,
emits transcription_partial events from push_chunk return values,
finalises via flush_stream on speech_ended,
cancels via cancel_stream on interruption.
Falls back to batch transcribe when streaming hooks are null, preserving today's behaviour.
Why: removes the "STT blocks worker thread for 2–3 s" item from known limitations; enables partial-transcript UIs; aligns the orchestrator with the API surface it already advertises.
5. Truncate ConversationContext to the spoken prefix on interruption
When the user interrupts mid-response, the LLM has already generated (and often the TTS has already synthesized) more text than the user actually heard. The assistant turn currently stored in ConversationContext is the full generated text, so on the next turn the model believes it said things the user never heard. This is a well-known issue in cloud frameworks (analogue: Pipecat #2791) and we get to avoid it by construction.
Proposal: track how many TTS samples were actually played versus synthesized. On interruption, map sample count back to a text position (TTS implementations expose alignment in some backends; word-level approximation is acceptable otherwise) and truncate the stored assistant message to that point. Append a marker (configurable) like …[interrupted] so the LLM understands the turn was cut.
Why: keeps multi-turn conversations coherent after barge-in; prevents the assistant from referencing things the user never heard; one of the few correctness issues that gets worse with longer sessions.
Non-goals for this issue
Wiring AEC into the pipeline — that's a consumer-side decision via EnhancerInterface / EchoCancellerInterface, no orchestration change needed.
Auto-resume on response_done — explicit resume_listening() keeps semantics simple; deferred.
New model implementations — out of scope.
Suggested PR sequencing
Items are mostly independent and can land in any order, but the natural sequence is:
1, 2 first (consumer-facing API, no behavioural change), then 4 (streaming STT), then 3 (audio path), then 5 (context truncation — slightly more involved alignment work).
A round of focused improvements to the orchestration core, driven by what every consumer (speech-swift, speech-android) currently re-implements on top of
VoicePipelineand what hurts the perceived quality of the loop. Each item is described functionally — implementation TBD on a per-PR basis.1. Move mic gating and cooldown into the pipeline
Today
push_audioaccepts every sample callers feed it, regardless of pipeline state. While the agent is speaking, in cooldown after speaking, or in a force-cut recovery window, callers must substitute silence themselves to keep VAD time-aligned (returning early creates a discontinuity that strands VAD). Every consumer is independently learning this and writing the same gating logic.Proposal: the pipeline owns the gate.
push_audiointernally substitutes zero-fill (preserving the VAD timeline) when inspeaking,post_playback_guard, orinterruption_recovery. Optionalsc_pipeline_set_capture_mode(SC_CAPTURE_RAW | SC_CAPTURE_GATED)for callers that need the raw path (e.g. AEC reference signal).Why: removes ~20 lines of subtle workaround from every consumer, eliminates a class of "VAD goes deaf after TTS" bugs, makes
push_audiohonest about its contract.2. Emit state and lifecycle events callers currently back-calculate
The pipeline has a state machine but only exposes it via a
sc_pipeline_state()poll accessor. Consumers either poll on a timer or maintain a parallel state machine driven by inference fromspeech_started/transcription_completed/response_done. The same is true for force-cut: callers detect it by measuring elapsed wall time and comparing againstmax_utterance_duration.Proposal: add events:
SC_EVENT_STATE_CHANGED— payload = new state.SC_EVENT_MAX_UTTERANCE_REACHED— fired when the recording is force-cut atmax_utterance_duration, beforespeech_ended.SC_EVENT_TRANSCRIPTION_PARTIAL— interim text from streaming STT (when wired; see Improve resampler quality and harden JSON parser #4).SC_EVENT_RESPONSE_TEXT_DELTA— token-level text from the LLM stream, separate from audio deltas, so UIs can show streaming captions.SC_EVENT_PLAYBACK_STARTED/SC_EVENT_PLAYBACK_FINISHED— distinct fromresponse_created/response_done, with sample count + duration in payload, so consumers don't track durations themselves.Why: eliminates parallel state machines in consumer code, removes brittle elapsed-time back-calculation, unlocks streaming-caption UIs that are table stakes for modern voice agents.
3. Float32 audio across the C ABI
SC_EVENT_RESPONSE_AUDIO_DELTAcurrently carries PCM16-LE bytes. TTS implementations natively produce Float32; consumers playing through CoreAudio/AudioTrack also want Float32. The pipeline converts Float32 → PCM16 to emit the event; the consumer immediately converts PCM16 → Float32 to play.Proposal: add a
const float* audio_f32+size_t audio_f32_lengthtosc_event_t, documentaudio_dataas a legacy PCM16 view kept for binary compatibility. New consumers read the Float32 field directly.Why: removes two allocations + two passes per audio chunk on the hot path, drops a precision round-trip, simplifies bridge code.
4. Use the streaming-STT vtable in the orchestrator
sc_stt_vtable_talready declaresbegin_stream/push_chunk/flush_stream/end_stream/cancel_stream, but the pipeline only invokes batchtranscribeafter VAD-offset. STT implementations that support streaming (e.g. Parakeet streaming) cannot be exercised through the pipeline today.Proposal: if the STT vtable provides streaming entry points, the pipeline:
begin_streamonspeech_started,push_chunkwhile listening,transcription_partialevents frompush_chunkreturn values,flush_streamonspeech_ended,cancel_streamon interruption.Falls back to batch
transcribewhen streaming hooks are null, preserving today's behaviour.Why: removes the "STT blocks worker thread for 2–3 s" item from known limitations; enables partial-transcript UIs; aligns the orchestrator with the API surface it already advertises.
5. Truncate
ConversationContextto the spoken prefix on interruptionWhen the user interrupts mid-response, the LLM has already generated (and often the TTS has already synthesized) more text than the user actually heard. The assistant turn currently stored in
ConversationContextis the full generated text, so on the next turn the model believes it said things the user never heard. This is a well-known issue in cloud frameworks (analogue: Pipecat #2791) and we get to avoid it by construction.Proposal: track how many TTS samples were actually played versus synthesized. On interruption, map sample count back to a text position (TTS implementations expose alignment in some backends; word-level approximation is acceptable otherwise) and truncate the stored assistant message to that point. Append a marker (configurable) like
…[interrupted]so the LLM understands the turn was cut.Why: keeps multi-turn conversations coherent after barge-in; prevents the assistant from referencing things the user never heard; one of the few correctness issues that gets worse with longer sessions.
Non-goals for this issue
EnhancerInterface/EchoCancellerInterface, no orchestration change needed.response_done— explicitresume_listening()keeps semantics simple; deferred.Suggested PR sequencing
Items are mostly independent and can land in any order, but the natural sequence is:
1, 2 first (consumer-facing API, no behavioural change), then 4 (streaming STT), then 3 (audio path), then 5 (context truncation — slightly more involved alignment work).