Pipeline orchestration improvements: gating, events, audio path, streaming STT, post-interrupt context

A round of focused improvements to the orchestration core, driven by what every consumer (speech-swift, speech-android) currently re-implements on top of `VoicePipeline` and what hurts the perceived quality of the loop. Each item is described functionally — implementation TBD on a per-PR basis.

---

### 1. Move mic gating and cooldown into the pipeline

Today `push_audio` accepts every sample callers feed it, regardless of pipeline state. While the agent is speaking, in cooldown after speaking, or in a force-cut recovery window, callers must substitute silence themselves to keep VAD time-aligned (returning early creates a discontinuity that strands VAD). Every consumer is independently learning this and writing the same gating logic.

**Proposal:** the pipeline owns the gate. `push_audio` internally substitutes zero-fill (preserving the VAD timeline) when in `speaking`, `post_playback_guard`, or `interruption_recovery`. Optional `sc_pipeline_set_capture_mode(SC_CAPTURE_RAW | SC_CAPTURE_GATED)` for callers that need the raw path (e.g. AEC reference signal).

**Why:** removes ~20 lines of subtle workaround from every consumer, eliminates a class of "VAD goes deaf after TTS" bugs, makes `push_audio` honest about its contract.

---

### 2. Emit state and lifecycle events callers currently back-calculate

The pipeline has a state machine but only exposes it via a `sc_pipeline_state()` poll accessor. Consumers either poll on a timer or maintain a parallel state machine driven by inference from `speech_started` / `transcription_completed` / `response_done`. The same is true for force-cut: callers detect it by measuring elapsed wall time and comparing against `max_utterance_duration`.

**Proposal:** add events:

- `SC_EVENT_STATE_CHANGED` — payload = new state.
- `SC_EVENT_MAX_UTTERANCE_REACHED` — fired when the recording is force-cut at `max_utterance_duration`, before `speech_ended`.
- `SC_EVENT_TRANSCRIPTION_PARTIAL` — interim text from streaming STT (when wired; see #4).
- `SC_EVENT_RESPONSE_TEXT_DELTA` — token-level text from the LLM stream, separate from audio deltas, so UIs can show streaming captions.
- `SC_EVENT_PLAYBACK_STARTED` / `SC_EVENT_PLAYBACK_FINISHED` — distinct from `response_created` / `response_done`, with sample count + duration in payload, so consumers don't track durations themselves.

**Why:** eliminates parallel state machines in consumer code, removes brittle elapsed-time back-calculation, unlocks streaming-caption UIs that are table stakes for modern voice agents.

---

### 3. Float32 audio across the C ABI

`SC_EVENT_RESPONSE_AUDIO_DELTA` currently carries PCM16-LE bytes. TTS implementations natively produce Float32; consumers playing through CoreAudio/AudioTrack also want Float32. The pipeline converts Float32 → PCM16 to emit the event; the consumer immediately converts PCM16 → Float32 to play.

**Proposal:** add a `const float* audio_f32` + `size_t audio_f32_length` to `sc_event_t`, document `audio_data` as a legacy PCM16 view kept for binary compatibility. New consumers read the Float32 field directly.

**Why:** removes two allocations + two passes per audio chunk on the hot path, drops a precision round-trip, simplifies bridge code.

---

### 4. Use the streaming-STT vtable in the orchestrator

`sc_stt_vtable_t` already declares `begin_stream` / `push_chunk` / `flush_stream` / `end_stream` / `cancel_stream`, but the pipeline only invokes batch `transcribe` after VAD-offset. STT implementations that support streaming (e.g. Parakeet streaming) cannot be exercised through the pipeline today.

**Proposal:** if the STT vtable provides streaming entry points, the pipeline:

- calls `begin_stream` on `speech_started`,
- forwards audio chunks via `push_chunk` while listening,
- emits `transcription_partial` events from `push_chunk` return values,
- finalises via `flush_stream` on `speech_ended`,
- cancels via `cancel_stream` on interruption.

Falls back to batch `transcribe` when streaming hooks are null, preserving today's behaviour.

**Why:** removes the "STT blocks worker thread for 2–3 s" item from known limitations; enables partial-transcript UIs; aligns the orchestrator with the API surface it already advertises.

---

### 5. Truncate `ConversationContext` to the spoken prefix on interruption

When the user interrupts mid-response, the LLM has already generated (and often the TTS has already synthesized) more text than the user actually heard. The assistant turn currently stored in `ConversationContext` is the full generated text, so on the next turn the model believes it said things the user never heard. This is a well-known issue in cloud frameworks (analogue: Pipecat #2791) and we get to avoid it by construction.

**Proposal:** track how many TTS samples were actually played versus synthesized. On interruption, map sample count back to a text position (TTS implementations expose alignment in some backends; word-level approximation is acceptable otherwise) and truncate the stored assistant message to that point. Append a marker (configurable) like `…[interrupted]` so the LLM understands the turn was cut.

**Why:** keeps multi-turn conversations coherent after barge-in; prevents the assistant from referencing things the user never heard; one of the few correctness issues that gets worse with longer sessions.

---

### Non-goals for this issue

- Wiring AEC into the pipeline — that's a consumer-side decision via `EnhancerInterface` / `EchoCancellerInterface`, no orchestration change needed.
- Auto-resume on `response_done` — explicit `resume_listening()` keeps semantics simple; deferred.
- New model implementations — out of scope.

### Suggested PR sequencing

Items are mostly independent and can land in any order, but the natural sequence is:
1, 2 first (consumer-facing API, no behavioural change), then 4 (streaming STT), then 3 (audio path), then 5 (context truncation — slightly more involved alignment work).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline orchestration improvements: gating, events, audio path, streaming STT, post-interrupt context #22

1. Move mic gating and cooldown into the pipeline

2. Emit state and lifecycle events callers currently back-calculate

3. Float32 audio across the C ABI

4. Use the streaming-STT vtable in the orchestrator

5. Truncate `ConversationContext` to the spoken prefix on interruption

Non-goals for this issue

Suggested PR sequencing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Pipeline orchestration improvements: gating, events, audio path, streaming STT, post-interrupt context #22

Description

1. Move mic gating and cooldown into the pipeline

2. Emit state and lifecycle events callers currently back-calculate

3. Float32 audio across the C ABI

4. Use the streaming-STT vtable in the orchestrator

5. Truncate ConversationContext to the spoken prefix on interruption

Non-goals for this issue

Suggested PR sequencing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

5. Truncate `ConversationContext` to the spoken prefix on interruption