A CPU-first, event-driven, local voice assistant built with a turn-aware multiprocessing architecture for low-latency, interruption-safe interaction.
Voice Assistant V2 is a local, edge-oriented voice assistant designed to run reliably on consumer laptops using CPU-only inference. Instead of relying on cloud APIs or sequential pipelines, it uses process isolation, streaming-first design, and turn-based cancellation to deliver responsive, interruption-safe voice interactions.
The system is intentionally optimized to run as a background service without monopolizing system resources, while still supporting more aggressive speculative execution when deployed on dedicated edge hardware.
Most DIY voice assistants prioritize feature demos over architectural rigor—often relying on blocking API calls, monolithic pipelines, or GPU-heavy inference. Voice Assistant V2 is engineered with a different goal: predictable, low-latency local interaction on CPUs, even under constrained system conditions.
The architecture uses quantized ONNX/GGUF models and a Hybrid Hub-and-Spoke multiprocessing design that cleanly separates:
- Control flow (conversation state, turn lifecycle, routing)
- Data flow (audio frames, tokens, PCM streams)
- Compute-heavy inference (STT, LLMs, TTS)
This allows Wake Word detection, Speech Recognition, Intent Parsing, and Speech Synthesis to execute concurrently, without being blocked by Python’s GIL.
- Local-first execution (no cloud inference)
- CPU-only inference
- Low time-to-first-audio
- Interruption-safe (barge-in) interaction
- Production-style control/data separation
The assistant consists of multiple isolated worker processes communicating via IPC queues. At the center is a turn-aware Orchestrator, responsible for conversation lifecycle management—not heavy computation.
-
Audio Input Raw
Int16/Float32audio frames are multicast to Wake Word and STT workers. -
Pause-Aware VAD A two-threshold VAD distinguishes between natural speech pauses and utterance completion.
-
Speech-to-Text (STT) Transcriptions are emitted as partial and final events.
-
Intent Parsing A local LLM converts finalized text into structured JSON (
chatvstool_use). -
Streaming Response Path Generated tokens bypass the Orchestrator and stream directly to the TTS worker.
-
Incremental Playback Audio is synthesized and played chunk-by-chunk for low perceived latency.
Instead of relying on a single silence timeout, the system distinguishes between:
- Micro-Pause (~0.3s) → Used for partial transcription and UI feedback.
- Macro-Pause (~2.0s) → Commits the utterance and triggers intent execution.
This makes the assistant feel responsive without cutting users off mid-thought.
The Orchestrator is not a simple finite state machine.
It acts as a turn-based control plane, responsible for:
- Managing assistant state (
IDLE → LISTENING → THINKING → SPEAKING) - Assigning a TurnContext to each user interaction
- Propagating cancellation signals across workers
- Preventing stale or late IPC events from corrupting state
- Owning bounded conversation history and prompt construction
This design mirrors production voice systems, where control flow and data flow are intentionally decoupled.
The system supports speculative intent decoding from partial STT events. However, speculative decoding is intentionally disabled by default.
- This assistant is designed to run as a background process on a laptop
- Continuous speculative decoding significantly increases CPU utilization
- Sustained high CPU usage can cause thermal throttling and degrade overall system usability
Instead, speculative decoding is treated as a deployment-time optimization:
- Enabled on dedicated edge devices
- Disabled for background laptop usage
This reflects a deliberate engineering tradeoff, not a missing feature.
All inference components are selected and tuned for CPU efficiency:
- Quantized model weights (ONNX int4) for moonshine
- Onnx model for piper and Silero vad
- LLM token streaming via Ollama (GGUF format)
- Streaming-friendly generation
- No GPU assumptions
- Predictable memory usage
| Component | Technology | Notes |
|---|---|---|
| Orchestration | Python multiprocessing |
Turn-aware, event-driven control plane |
| Wake Word | Picovoice Porcupine | High recall, ultra-low CPU usage |
| VAD | Silero VAD v4 (ONNX) | Custom debouncing + padding |
| STT | Moonshine (ONNX Int4) | Sliding buffer with context |
| Intent Engine | Ollama (Qwen3-0.6B) | Structured JSON extraction |
| Response LLM | Ollama (Qwen/Llama) | Token-level streaming |
| TTS | Piper (ONNX) | Faster-than-realtime synthesis |
The system control layer. Coordinates events across isolated processes without handling raw audio or model inference.
Key responsibilities:
- Turn-scoped cancellation for safe concurrency
- Barge-in handling and interruption recovery
- Ghost-event suppression
- Early state transitions to support streaming
- Bounded conversation memory
Combines Silero VAD and Moonshine STT using a sliding audio buffer and pause-aware transcription logic.
Normalizes user speech into structured JSON:
{
"action_type": "tool_use",
"refined_query": "Play jazz music on YouTube",
"tool_calls": [
{
"tool": "browser_search",
"params": { "query": "jazz music youtube" }
}
]
}Optimized for latency:
- Text streams to TTS as soon as sentence boundaries are detected
- Audio streams to playback immediately as PCM chunks are produced
Each user interaction is modeled as an independent turn.
Each turn:
- Has a unique ID
- Carries a shared cancellation token
- Can be aborted safely at any stage
If a wake word is detected while the assistant is speaking or thinking:
- The active turn is cancelled
- All downstream workers discard stale events
- A new turn begins immediately
This enables natural interruptions without overlapping audio or race-condition artifacts.
spawnmultiprocessing context (safe with ONNX / PyTorch)- Ring buffers with audio pre-roll
- Dual audio formats (
Int16for speed,Float32for accuracy) - Direct queue handoff for streaming paths
- Turn-scoped cancellation to avoid wasted compute
The project includes a deterministic evaluation harness (eval/run_eval.py) that simulates a real user interaction with a pre-recorded audio file.
Latest stable run was executed under controlled conditions (no background load, near-max CPU allocation to eval):
uv run python -m eval.run_eval --wav .\eval\scenarios\audio16.wav --startup-wait-s 3 --idle-window-s 1 --completion-timeout-s 30 --log-level WARNING --disable-intent| Metric | Result | Why it matters |
|---|---|---|
End-to-End Latency (e2e_latency_ms) |
4125.00 ms | Full turn duration from input start to playback completion |
Time to First Audio (time_to_first_audio_ms) |
3844.00 ms | User-perceived response start |
Time to First Token (ttft_ms) |
N/A | Intent was intentionally bypassed in eval-only mode (--disable-intent) |
System RTF (rtf_system) |
0.38 | Post-STT-final pipeline efficiency (Intent/LLM/TTS/Playback processing window) |
End-to-End RTF (rtf_e2e) |
2.39 | Full UX factor including speech + silence + processing |
CPU Idle (cpu_idle_pct) |
7.86% | Baseline resource footprint |
CPU Active (cpu_active_pct) |
419.03% | Aggregate multi-core utilization during active turn |
Peak Memory (memory_peak_mb) |
697.27 MB | Peak RSS measured during the active window, excluding Ollama model loading. |
| Stage Delta | Result |
|---|---|
| STT final -> first LLM token | 234.00 ms |
| First LLM token -> first TTS audio | 141.00 ms |
| First TTS audio -> playback first audio | 0.00 ms |
These timings confirm efficient downstream streaming once STT finalization is reached.
The assistant runs fully locally after initialization.
- Wake Word detection uses Picovoice Porcupine
- A one-time internet connection may be required for access-key validation
- After validation, all audio processing and inference runs offline
Porcupine was chosen intentionally for its unmatched CPU efficiency, robustness, and ease of custom wake-word creation.
This project integrates and builds upon several high-quality open-source models and tools. The original authors deserve explicit credit for their work.
-
Moonshine (STT) Offline speech recognition models used via ONNX for CPU-efficient transcription. Repository: https://github.com/moonshine-ai/moonshine
-
Silero VAD Voice activity detection models used with custom debouncing and pause-aware logic. Repository: https://github.com/snakers4/silero-vad
-
Piper TTS Offline text-to-speech synthesis with streaming PCM output. Licensed under GPL-3.0 and used as a standalone component. Repository: https://github.com/OHF-Voice/piper1-gpl
-
Picovoice Porcupine Wake word detection engine selected for efficiency and robustness on CPUs. Repository: https://github.com/Picovoice/porcupine
-
Ollama Local LLM runtime used for CPU-based inference and token-level streaming. Repository: https://github.com/ollama/ollama
All system-level architecture, orchestration logic, multiprocessing design, streaming paths, and turn-based control were designed and implemented in this project.
If you use this project in academic or research work, please consider citing the original papers below.
Moonshine
@misc{jeffries2024moonshinespeechrecognitionlive,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
year={2024},
eprint={2410.15608},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.15608}
}Silero VAD
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}- Python 3.12+
- Ollama running
portaudio
git clone https://github.com/SrabanMondal/voice-assistant-v2.git
cd voice-assistant-v2
uv sync
ollama pull qwen3:0.6bDownload Moonshine and Piper ONNX models into the
models/directory.
python main.pyThis repository does not redistribute model weights.
The assistant relies on pre-trained models provided by their respective authors (Moonshine, Silero, Piper, Ollama). Due to licensing, size, and distribution considerations, users are expected to obtain the weights directly from the official repositories.
The inference pipeline, configuration, and integration logic in this project are designed to work with specific ONNX/GGUF model variants.
Note: If you are experimenting with this project and need guidance on compatible model variants or configurations, feel free to reach out.
- RAG-based long-term memory
- Vision-to-Intent (LLaVA)
- MQTT / Home Assistant integration
- Desktop & OS-level automation tools
This project is released under the MIT License.
It integrates third-party components with different licenses:
- Moonshine (MIT)
- Silero VAD (MIT)
- Piper TTS (GPL-3.0, used as a standalone component)
Users are responsible for complying with the licenses of any third-party tools they install.
Built with care, queues, and a deliberate focus on CPU-first, low-latency systems design.