Scribe ships five speech-to-text backends. They are all picked through
the same --backend / --model CLI flags (or the Model submenu in
the tray / terminal frontend). Whether a transcription is streaming
(text appears live as you speak) or batch (text arrives at end of
recording) depends on the model chosen — not the backend.
| Backend | --backend |
Default model | Streaming model(s) | Requires |
|---|---|---|---|---|
| Groq (cloud) | groq |
whisper-large-v3-turbo |
— | GROQ_API_KEY |
| OpenAI (cloud) | openai |
gpt-4o-mini-transcribe |
gpt-realtime-whisper |
OPENAI_API_KEY |
| Whisper FUTO (local) | whisper-futo |
small |
— | pip install scribe-cli[whisper-futo] |
| Whisper (local) | whisper |
small |
— | pip install scribe-cli[whisper] |
| Vosk (local) | vosk |
language-dependent | all Vosk models | pip install scribe-cli[vosk] |
Run scribe without arguments and it picks the first backend whose
dependency / API key is present, preferring cloud over local and the
faster local option first:
groq → openai → whisper-futo → whisper → vosk.
Runs locally via whisper.cpp
(through pywhispercpp)
using FUTO's ACFT-finetuned models.
ACFT (Audio Context Fine-Tuning) lets the encoder run on the actual
audio length instead of always padding to 30 s — a meaningful speedup
on short dictations, which is the typical scribe workload.
The available models offered in the tray menu are
tiny / base / small. FUTO has not released ACFT weights for
medium / large / turbo; for those sizes use the whisper backend.
With --language en (or -l en) scribe auto-substitutes the
English-only variant (e.g. small → small.en) when it exists.
Models are auto-downloaded on first use from voiceinput.futo.org
to $XDG_CACHE_HOME/whisper-futo/ (override with
--download-folder-whisper-futo).
For audio ≥ 30 s the ACFT speedup tapers off and the encoder window
collapses to the standard 30 s; quality and speed in that regime are
similar to the whisper backend. Pick whisper-futo if most of your
dictations are short, the whisper backend if you regularly do
multi-minute recordings or need medium / large / turbo.
Runs locally via
faster-whisper and
defaults to the small model. Excellent at full-utterance
transcription in
many languages,
but it does not stream — text appears at end-of-recording — and
execution time depends on model size and hardware.
The available models offered in the tray menu are
tiny / base / small / medium / large-v3 / large-v3-turbo. Larger
models trade speed for accuracy.
With --language en (or -l en) scribe auto-substitutes the
English-only variant (e.g. small → small.en) when it exists.
Vosk transcribes in real time and is very good at one language at a time, but tends to make more mistakes than Whisper and does not produce punctuation. It becomes really useful in longer, interactive sessions where the live "appears as you speak" UX matters — see output.md for how the keystroke mode interacts with streaming models.
There are many Vosk models
available; a handful are pre-mapped to common languages (en, fr,
de, it) in
scribe/models.toml. Pick one with
-l <lang> or browse the full list interactively from the menu.
The OpenAI backend supports three models:
gpt-4o-mini-transcribe(default) — fast, low-cost batch transcription.gpt-4o-transcribe— higher-quality batch transcription.gpt-realtime-whisper(streaming) — partial transcripts arrive as you speak. Same UX as Vosk but using OpenAI's cloud model.
All three share the same OPENAI_API_KEY and the [openai] extra; no
extra dependencies. Set the key once:
export OPENAI_API_KEY=YOURAPIKEY
scribe --backend openai # default: gpt-4o-mini-transcribe
scribe --model gpt-4o-transcribe # batch, higher quality
scribe --model gpt-realtime-whisper # streaming--model alone auto-routes to the openai backend for any of the
three models above, so --backend openai is optional.
The streaming model has a latency-vs-accuracy knob exposed as
--realtime-delay {minimal,low,medium,high,xhigh} (default medium).
Lower values emit partial transcripts sooner — at the cost of more
revisions arriving in the focused window. Higher values batch tokens
into longer chunks so what gets pasted is more stable.
See OpenAI's gpt-realtime-whisper model card for the full picture.
Talks to Groq's OpenAI-compatible API and defaults to
whisper-large-v3-turbo. Extremely fast thanks to Groq's
inference hardware — the recommended cloud backend by default, and
the natural pick for --stream mode where per-chunk roundtrip
latency dominates perceived speed:
export GROQ_API_KEY=YOURAPIKEY
scribe --backend groq # Clip mode (default)
scribe --backend groq --stream # live transcription, per-chunkThe groq backend reuses the openai Python client under the hood, so
installing [openai] is enough for both.
For batch models (Whisper local, Whisper-via-API, Groq, gpt-4o-*) the
recording continues for up to 2 minutes until you stop it manually
(Stop in the tray, Ctrl+C in the terminal) — the transcription happens
once when you stop.
Streaming models (Vosk, gpt-realtime-whisper) emit partials as you
speak and stop on the same Stop / Ctrl+C action.
--prompt TEXT and --words W [W ...] (plus the --prompt-file /
--words-file companions) bias the recogniser toward a particular
style, domain, or word list. The concept is generic across the
whisper-family backends but each backend exposes it slightly
differently:
| Backend | --prompt |
--words |
--language |
|---|---|---|---|
whisper (faster-whisper, local) |
passed as initial_prompt= |
passed as hotwords= — a dedicated biasing channel separate from the prompt |
passed as language= (ISO 639-1); -l en also auto-substitutes small.en etc. |
whisper-futo (pywhispercpp, local) |
passed as initial_prompt= |
joined onto the prompt string (no separate hotwords channel here) | passed as language= (ISO 639-1); -l en auto-substitutes small.en etc. |
openai batch (gpt-4o*-transcribe) |
passed as prompt= |
joined onto the prompt string | passed as language= hint (ISO 639-1) |
groq (whisper-large-v3-turbo) |
passed as prompt= |
joined onto the prompt string | passed as language= hint (ISO 639-1) |
openai realtime (gpt-realtime-whisper) |
silently ignored — the model rejects the prompt parameter server-side (HTTP 400 "The 'prompt' parameter is not supported for this model."). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt | passed as language= (ISO 639-1) |
vosk |
ignored (no soft prompt) | ignored (Vosk only supports a hard grammar allowlist; not yet exposed) |
picks a per-language model from scribe/models.toml; no runtime parameter |
The whisper-family APIs cap the prompt around ~224 tokens; longer
hints are silently truncated. Faster-whisper's hotwords channel is
the one place a separate "dictionary" really exists — everywhere else
--words is just a convenience to keep your word list out of the
prompt string in the CLI.
Whisper mirrors the style of whatever prompt it receives. A
prompt like "Tierney Comet" (a bare wordlist) biases the model
toward unpunctuated, list-style output — sentences come out without
periods. A prompt like "Tierney, Comet." (or any prose ending in a
period) biases it toward punctuated output. Two practical
consequences:
--promptis yours to control. If yourprompt.txtends with a period and looks like a sentence, your transcripts will be punctuated. If it ends with a bare keyword, they probably won't. This effect is most visible in Stream mode, where Whisper sees short audio chunks and leans more heavily on the prompt for style cues.--wordsis auto-formatted by scribe. For backends that fold words into the prompt (whisper-futo,openai,groq), scribe renders the word list as"word1, word2, …, wordN."— comma- separated with a single terminal period — so yourwords.txtcan stay a bare list with no special formatting and the bias still comes out punctuated. Stray punctuation on individual entries is stripped first, sowords.txtcontent is normalised regardless of layout. Onwhisper(faster-whisper, local), words go to the dedicatedhotwordschannel and bypass the prompt entirely.
Both flags read from the corresponding *-file argument when present.
Inline + file inputs are combined.
# Inline
scribe --prompt "ML systems infra: K8s, etcd, Envoy." \
--words kubectl envoyproxy etcdctl
# From files (handy for long-lived glossaries)
scribe --prompt-file ~/.config/scribe/prompt.txt \
--words-file ~/.config/scribe/words.txtWhen no prompt/words flag is given, scribe also auto-loads
prompt.txt and words.txt from the platform user-config dir
(~/.config/scribe/ on Linux, ~/Library/Application Support/scribe/
on macOS, %LOCALAPPDATA%\scribe\ on Windows — resolved via
platformdirs) if they exist. To suppress the default for one
invocation, pass an explicit empty value: --prompt "" (or
--prompt-file "") suppresses the prompt default; --words with no
arguments (or --words-file "") suppresses the words default. Each
side is independent.
-l / --language LANG tells the backend which language to expect.
What that means in practice varies by backend (see the per-backend
column in the table above):
- Whisper-family (
whisper,whisper-futo,openaibatch + realtime,groq) — the language is passed to the model as a hard lock: the decoder generates that language regardless of what it hears acoustically. Accepts any ISO 639-1 short code Whisper recognises (~99 languages). When unset, Whisper auto-detects per chunk. - English-only model variants — for
whisperandwhisper-futo,-l enalso auto-substitutes the English-only model when one exists (small→small.en, etc.). These variants trade multilingual coverage for English accuracy. - Vosk — language isn't a runtime parameter; vosk ships a
separate model per language.
-l frlooks up the vosk model pre-mapped to French inscribe/models.tomland instantiates that one. Vosk has no auto-detect path, so the Language menu'sAutoentry on vosk falls back to a sensible default — the tray showsAuto (🇬🇧 en)to make this explicit without mutating the storedlanguage=None.
The tray's Language submenu exposes the four curated languages
(en / fr / de / it) with origin-country flag prefixes
(🇬🇧 / 🇫🇷 / 🇩🇪 / 🇮🇹). The CLI accepts these plus any other ISO 639-1
code the active backend recognises.
--stream (or Mode: Stream in the tray) emits transcribed text
live as you speak, regardless of which backend you picked. This
is the headline v1.0.0 improvement: scribe abstracts over the two
different mechanisms that backends use to deliver live output, so
--stream works uniformly across every supported backend.
- Native streaming backends (Vosk,
gpt-realtime-whisper) push partial results from the server as audio is received — scribe just forwards them to the chosen output (focused window / clipboard / terminal / file). These backends are always in Stream mode; the Mode toggle reads "Mode: Stream (native)" for them and is read-only. - Batch backends (Whisper local, Whisper FUTO, OpenAI
gpt-4o-*-transcribe, Groqwhisper-large-v3-turbo) don't accept partial audio. scribe instead cuts the recording buffer on detected silence and issues a separate transcription request for each chunk — internally called pseudo-streaming. The user sees the same live experience.
scribe --stream # any backend, live transcription
scribe --stream --backend groq # Groq + Stream is the sweet spot
scribe --stream --backend whisper # local, live, no API keyOnce the buffer has grown to at least --stream-chunk-min (default
1.5 s), silence of at least --stream-chunk-silence-break (default
0.6 s) triggers a chunk cut. A force-cut fires at --stream-chunk-max
(default 10 s) regardless of silence, to cap latency. The session
continues until you stop it manually.
The first chunk of a streaming thread uses a different floor:
--stream-first-chunk-min (default 3 s). The bootstrap chunk has no
prior text to bias Whisper's punctuation/casing, so a longer audio
window lets the model produce a properly-punctuated transcript whose
tail then seeds the rolling prompt for every chunk after it.
Subsequent chunks fall back to --stream-chunk-min. The override
also re-engages right after a context-reset silence (i.e. when a long
pause cleared the rolling tail — see Cross-chunk prompt context
below). Set --stream-first-chunk-min equal to --stream-chunk-min
to disable the override. It's automatically inactive when
--stream-context-length 0 (Patient profile), where there is no
rolling context to bootstrap. Internally clamped to ≤ --stream-chunk-max so a misconfigured pair can't deadlock the
chunker.
For cloud backends, going from one big transcription to N chunked requests does not normally change the bill:
- Groq (
whisper-large-v3-turbo) is billed per second of audio. Total audio is unchanged → same cost. - OpenAI
whisper-1(legacy) is billed per minute of audio. Same logic, same cost. - OpenAI
gpt-4o-transcribe/gpt-4o-mini-transcribeare token- billed (audio-in + text-out + prompt-in). Audio and output stay identical; the only delta is the rolling cross-chunk prompt context (~200 chars ≈ 50–60 tokens per chunk after the first). At gpt-4o-mini-transcribe input rates this is negligible — well under a cent per long session.
That said, your real cost depends on your usage and your account's pricing tier — verify on your provider's billing dashboard if cost is a hard constraint.
Two special values for --stream-chunk-silence-break (set via the
tray's Silence break picker or --stream-chunk-silence-break 0
at the CLI):
- Auto (
0) — disables the fixed-threshold trigger. At force-cut time scribe picks the longest silence interval within the window whose start position is at least--stream-chunk-mininto the chunk, re-cutting there for a more natural word boundary. Falls back to a brute force-cut if no qualifying silence is found. - Max — disables silence-based cuts entirely; only the force-cut at
--stream-chunk-maxfires. Useful when you want uniform chunk sizes regardless of speech patterns. (Only selectable from the tray picker.)
Stream mode is off by default — the default Clip mode transcribes the
whole recording at end (--clip). The tray menu surfaces the same
toggle as the top-level Mode: Stream / Clip item. Native
streamers (vosk, gpt-realtime-whisper) are always streaming and the
menu shows Mode: Stream (native) for them.
In Stream mode (pseudo-streaming) scribe automatically augments
each chunk's prompt with the trailing ~200 characters of the
previous chunk's transcription. This rolling tail is concatenated
onto whatever static --prompt / --words you configured and
reaches the backend through the same channel as the static prompt
(the vocabulary biasing table above). The motivation is cross-chunk
continuity:
- Capitalization drift — without context, a chunk that starts right after a period might come back lowercased.
- Article gender (FR/IT/ES/…) —
"la nouveau"→"le nouveau"once the prior chunk has established the noun. - Language lock —
whisper.cppauto-detects language per call; feeding the previous chunk's tokens keeps the language stable across cuts.
Whisper's prompt window is capped at ~224 tokens; 200 chars of French sits well under that and leaves room for your static prompt + words list.
The rolling tail is dropped when the silence between two
utterances exceeds --stream-context-reset-silence ×
--stream-chunk-silence-break (default 3 × 0.6 s = 1.8 s) — a long
pause is treated as a new sentence/idea boundary, where carrying a
possibly-bad prior chunk forward biases the next one more than it
helps. Use --stream-context-reset-silence inf to keep context across
arbitrarily long pauses.
Short pauses (mid-sentence punctuation) keep the context; the cut at the start of every new recording also clears it.
The defaults stream phrases in as you talk; the Patient profile waits for natural pauses and transcribes one utterance at a time. They make opposite trade-offs around the same fundamental tension: short audio windows give Whisper less to work with, so cross-chunk context matters more in Balanced, less in Patient.
scribe --streamPhrases commit every ~10 s or on a 0.6 s pause, with a 200-char rolling prompt carrying earlier text forward as context for each new chunk. Whisper sees short audio windows in isolation; the rolling context partially compensates by telling the model what was just said. Good live-feel, small per-chunk accuracy hit vs. Patient.
scribe --stream \
--stream-chunk-min 0.5 \
--stream-chunk-max 300 \
--stream-chunk-silence-break 2 \
--stream-context-length 0Each utterance is a complete self-contained sentence. scribe waits
for a 2 s pause, transcribes the whole thing at once, then waits for
the next one. No rolling context (context-length 0) because each
chunk is already a full utterance — there's nothing short to
compensate for. Highest per-chunk accuracy; no text appears until
you finish talking.