feat(channels): voice transcription hook in Chat SDK bridge#2458
Closed
mtichikawa wants to merge 2 commits into
Closed
feat(channels): voice transcription hook in Chat SDK bridge#2458mtichikawa wants to merge 2 commits into
mtichikawa wants to merge 2 commits into
Conversation
Add an opt-in voice transcription pass to the shared Chat SDK bridge. When WHISPER_BIN is set and an inbound attachment looks like audio (mimeType audio/* or coarse type audio/voice), the bridge runs whisper.cpp on the buffer after fetchData() and appends the transcript to the message content as [Voice: <transcript>]. Skipped silently when WHISPER_BIN is unset, so no behavior changes for existing installs. The hook lives in chat-sdk-bridge.ts (shared by Discord, Slack, Teams, Webex, etc.) rather than per-adapter, mirroring how the bridge already handles attachment download centrally. Pairs with the SKILL.md trio added in a sibling PR on main. - src/transcription.ts: transcribeAudioBuffer(Buffer) + isAudioAttachment. Channel-agnostic. Uses node:child_process to shell out to ffmpeg (input normalization → 16 kHz mono WAV) and whisper-cli. Returns null on any error or empty output; the bridge only injects [Voice: ...] when the transcript is non-empty. - src/channels/chat-sdk-bridge.ts: gated transcription pass after fetchData(). Stores transcript on the attachment entry as well so any downstream consumer (e.g., formatter, agent-runner) can use it. - src/transcription.test.ts: 8 tests covering isAudioAttachment, env-gate, trim, empty output, and execFile failure paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 tasks
Collapses multi-line execFileAsync calls to match the project's formatter. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
Closing — wrong target. After re-reading CONTRIBUTING.md, feature-skill PRs should be a single PR against Folded both changes into #2459 against |
This was referenced May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in voice transcription pass to the shared Chat SDK bridge. When
WHISPER_BINis set and an inbound attachment looks like audio, the bridge runs whisper.cpp on the buffer afterfetchData()and appends the transcript to the message content as[Voice: <transcript>].When
WHISPER_BINis unset (the default), the new code path is a no-op — zero behavior change for existing installs.Why one bridge hook instead of per-adapter
The Chat SDK bridge is shared by Discord, Slack, Teams, Webex, Google Chat, and every future Chat SDK-supported platform. Hooking transcription here means all bridge-based channels gain voice support from a single integration point. The bridge already handles attachment download centrally (
messageToInbound→att.fetchData()), so the transcription pass slots in naturally right after the buffer is in hand.Sibling effort: @ira-at-work's #2317 (
/add-voice-transcription-free-whisper) patches Signal/Telegram/WhatsApp adapters directly (they don't go through the Chat SDK bridge). That PR + this one would together cover every channel.Demand signal: #2426 (just opened by @b1ek) — "LLM cant see the image in discord" — voice transcription is the same kind of gap.
Files
src/transcription.ts(new, ~95 lines):transcribeAudioBuffer(Buffer): Promise<string | null>plusisAudioAttachment(att)helper. Channel-agnostic. Shells out toffmpegfor input normalization (any container → 16 kHz mono WAV) andwhisper-clifor the transcript. Returns null on any failure or empty output; the bridge only injects[Voice: ...]when the transcript is non-empty.src/channels/chat-sdk-bridge.ts(+15 lines): gated transcription pass insidemessageToInbound(). Runs only whenprocess.env.WHISPER_BINis set and the attachment passesisAudioAttachment(mimeType: audio/*or coarsetype: 'audio'/'voice').src/transcription.test.ts(new, 8 tests):isAudioAttachmenttruth table;transcribeAudioBufferenv-gate, trim, empty-output, and execFile-failure paths.Total: 3 files, +236 lines.
Pairs with
A sibling PR against
mainadds the user-facing skill metadata:.claude/skills/add-discord-voice-transcription/{SKILL.md, REMOVE.md, VERIFY.md}following the @ddaniels Signal template (#1953). It instructs users to fetch the modifiedchat-sdk-bridge.tsand the newtranscription.tsfromorigin/channelsafter this PR lands.Test plan
pnpm run build— clean (3 pre-existing errors in deltachat/slack/telegram unrelated to this change)npx vitest run src/transcription.test.ts— 8/8 passingnpx vitest run src/channels/chat-sdk-bridge.test.ts— 7/7 passing (unchanged)WHISPER_BIN=whisper-cliset, confirm the agent receives[Voice: <transcript>]inline. Will run this once the channels-branch checkout is up.WHISPER_BINunset, confirm voice attachments arrive as plain audio placeholders (no regression).🤖 Generated with Claude Code