Skip to content

feat(channels): voice transcription hook in Chat SDK bridge#2458

Closed
mtichikawa wants to merge 2 commits into
nanocoai:channelsfrom
mtichikawa:feat/voice-transcription-bridge
Closed

feat(channels): voice transcription hook in Chat SDK bridge#2458
mtichikawa wants to merge 2 commits into
nanocoai:channelsfrom
mtichikawa:feat/voice-transcription-bridge

Conversation

@mtichikawa
Copy link
Copy Markdown

Summary

Adds an opt-in voice transcription pass to the shared Chat SDK bridge. When WHISPER_BIN is set and an inbound attachment looks like audio, the bridge runs whisper.cpp on the buffer after fetchData() and appends the transcript to the message content as [Voice: <transcript>].

When WHISPER_BIN is unset (the default), the new code path is a no-op — zero behavior change for existing installs.

Why one bridge hook instead of per-adapter

The Chat SDK bridge is shared by Discord, Slack, Teams, Webex, Google Chat, and every future Chat SDK-supported platform. Hooking transcription here means all bridge-based channels gain voice support from a single integration point. The bridge already handles attachment download centrally (messageToInboundatt.fetchData()), so the transcription pass slots in naturally right after the buffer is in hand.

Sibling effort: @ira-at-work's #2317 (/add-voice-transcription-free-whisper) patches Signal/Telegram/WhatsApp adapters directly (they don't go through the Chat SDK bridge). That PR + this one would together cover every channel.

Demand signal: #2426 (just opened by @b1ek) — "LLM cant see the image in discord" — voice transcription is the same kind of gap.

Files

  • src/transcription.ts (new, ~95 lines): transcribeAudioBuffer(Buffer): Promise<string | null> plus isAudioAttachment(att) helper. Channel-agnostic. Shells out to ffmpeg for input normalization (any container → 16 kHz mono WAV) and whisper-cli for the transcript. Returns null on any failure or empty output; the bridge only injects [Voice: ...] when the transcript is non-empty.
  • src/channels/chat-sdk-bridge.ts (+15 lines): gated transcription pass inside messageToInbound(). Runs only when process.env.WHISPER_BIN is set and the attachment passes isAudioAttachment (mimeType: audio/* or coarse type: 'audio'/'voice').
  • src/transcription.test.ts (new, 8 tests): isAudioAttachment truth table; transcribeAudioBuffer env-gate, trim, empty-output, and execFile-failure paths.

Total: 3 files, +236 lines.

Pairs with

A sibling PR against main adds the user-facing skill metadata: .claude/skills/add-discord-voice-transcription/{SKILL.md, REMOVE.md, VERIFY.md} following the @ddaniels Signal template (#1953). It instructs users to fetch the modified chat-sdk-bridge.ts and the new transcription.ts from origin/channels after this PR lands.

Test plan

  • pnpm run build — clean (3 pre-existing errors in deltachat/slack/telegram unrelated to this change)
  • npx vitest run src/transcription.test.ts — 8/8 passing
  • npx vitest run src/channels/chat-sdk-bridge.test.ts — 7/7 passing (unchanged)
  • Live test: send a Discord voice memo with WHISPER_BIN=whisper-cli set, confirm the agent receives [Voice: <transcript>] inline. Will run this once the channels-branch checkout is up.
  • Live test: same with WHISPER_BIN unset, confirm voice attachments arrive as plain audio placeholders (no regression).

🤖 Generated with Claude Code

Add an opt-in voice transcription pass to the shared Chat SDK bridge. When
WHISPER_BIN is set and an inbound attachment looks like audio (mimeType
audio/* or coarse type audio/voice), the bridge runs whisper.cpp on the
buffer after fetchData() and appends the transcript to the message content
as [Voice: <transcript>]. Skipped silently when WHISPER_BIN is unset, so
no behavior changes for existing installs.

The hook lives in chat-sdk-bridge.ts (shared by Discord, Slack, Teams,
Webex, etc.) rather than per-adapter, mirroring how the bridge already
handles attachment download centrally. Pairs with the SKILL.md trio added
in a sibling PR on main.

- src/transcription.ts: transcribeAudioBuffer(Buffer) + isAudioAttachment.
  Channel-agnostic. Uses node:child_process to shell out to ffmpeg (input
  normalization → 16 kHz mono WAV) and whisper-cli. Returns null on any
  error or empty output; the bridge only injects [Voice: ...] when the
  transcript is non-empty.
- src/channels/chat-sdk-bridge.ts: gated transcription pass after
  fetchData(). Stores transcript on the attachment entry as well so any
  downstream consumer (e.g., formatter, agent-runner) can use it.
- src/transcription.test.ts: 8 tests covering isAudioAttachment, env-gate,
  trim, empty output, and execFile failure paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses multi-line execFileAsync calls to match the project's formatter.
No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mtichikawa
Copy link
Copy Markdown
Author

Closing — wrong target. After re-reading CONTRIBUTING.md, feature-skill PRs should be a single PR against main containing both the SKILL.md trio and the source code, with maintainers extracting the code to a skill/<name> branch on merge. Also src/channels/chat-sdk-bridge.ts exists on main (not just channels), so the modification belongs there.

Folded both changes into #2459 against main — that's now a single self-contained feature-skill PR following the template you used for @ddaniels' merged Signal PR (#1953).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant