Real-Time Voice AI Gateway
Quick Start • Features • Providers • SDKs • API • Config
WaaV Gateway is a high-performance, real-time voice processing server built in Rust. It provides a unified interface for Speech-to-Text (STT) and Text-to-Speech (TTS) services across multiple cloud providers, with advanced audio processing capabilities including noise suppression and intelligent turn detection. WaaV features a powerful DAG-based pipeline engine for building custom voice processing workflows with conditional routing and multi-provider orchestration.
WaaV eliminates the complexity of integrating with multiple voice AI providers by providing a single WebSocket and REST API that abstracts away provider-specific implementations. Switch between Deepgram, ElevenLabs, Google Cloud, Azure, Cartesia, OpenAI, Amazon Transcribe, Amazon Polly, IBM Watson, Groq, or LMNT with a simple configuration change—no code modifications required.
Key Highlights:
- 70+ Cloud Providers - Global STT/TTS coverage including Deepgram, ElevenLabs, Google Cloud, Azure, OpenAI, plus regional providers for India (Sarvam, Gnani, Bhashini), China (Alibaba, Baidu, Tencent, iFlytek), Southeast Asia (Zalo, FPT, NECTEC), and more
- DAG Pipeline Engine - Build custom voice workflows with conditional routing, multi-provider orchestration, and data transformations
- OpenAI & Hume AI Realtime - Full-duplex audio-to-audio streaming with GPT-4o and Hume EVI
- WebSocket Streaming - Real-time bidirectional audio with sub-second latency
- LiveKit Integration - WebRTC rooms and SIP telephony support
- Advanced Audio Processing - DeepFilterNet noise suppression, ONNX-based turn detection
- Production-Ready - HTTP/2 connection pooling, intelligent caching, rate limiting, JWT auth
- Quick Start
- Audio Processing Pipeline
- Features
- DAG Pipeline Engine
- Providers
- Architecture
- Installation
- Client SDKs
- API Reference
- Configuration
- Performance
- Contributing
- License
Get your first transcription running in under 5 minutes:
# Clone the repository
git clone https://github.com/bud-foundry/waav.git
cd waav/gateway
# Configure (add your API key)
cp config.example.yaml config.yaml
# Edit config.yaml and set your deepgram_api_key
# Build and run
cargo run --release
# Test health check
curl http://localhost:3001/
# Returns: {"status":"ok"}
# Test TTS
curl -X POST http://localhost:3001/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello from WaaV!", "tts_config": {"provider": "deepgram"}}' \
--output hello.pcm
# Play the audio (requires sox)
play -r 24000 -e signed -b 16 -c 1 hello.pcmWaaV provides a complete audio processing pipeline with optional pre-processing and post-processing stages:
flowchart LR
subgraph Input
A[Audio Input<br/>16kHz 16-bit PCM]
end
subgraph PreProcess["Pre-Processing"]
B[DeepFilterNet<br/>Noise Filter]
B1[SNR-adaptive]
B2[Echo suppress]
B3[40dB max]
B4[Thread pool]
end
subgraph Transcription["STT Provider"]
C[Multi-Provider]
C1[Deepgram]
C2[Google gRPC]
C3[Azure]
C4[ElevenLabs]
C5[Cartesia]
C6[OpenAI]
C7[AWS Transcribe]
C8[IBM Watson]
C9[Groq]
end
subgraph PostProcess["Post-Processing"]
D[Turn Detection<br/>ONNX Model]
D1[Probability]
D2[Threshold 0.7]
D3["<50ms"]
end
subgraph Output
E[Text Output]
end
A --> B
B --> C
C --> D
D --> E
style A fill:#e1f5fe
style E fill:#e8f5e9
style B fill:#fff3e0
style C fill:#fce4ec
style D fill:#f3e5f5
Feature flag: --features noise-filter
Advanced noise reduction powered by DeepFilterNet:
| Feature | Description |
|---|---|
| Adaptive Processing | SNR-based analysis—high SNR audio receives minimal filtering to preserve quality |
| Energy Analysis | Automatic silence detection, skips processing after 5 consecutive silent frames |
| Echo Suppression | Post-filter with 0.02 beta for mobile and conference call optimization |
| Attenuation Limiting | 40dB maximum reduction prevents over-processing artifacts |
| Thread Pool | One worker thread per CPU core for parallel processing |
| Short Audio Handling | Light 80Hz high-pass filter for clips under 1 second |
Feature flag: --features turn-detect
Intelligent end-of-turn detection using ONNX Runtime with LiveKit's turn-detector model:
| Feature | Description |
|---|---|
| Model | SmolLM-based from HuggingFace (livekit/turn-detector) |
| Threshold | Configurable (default 0.7), per-language thresholds supported |
| Tokenization | HuggingFace tokenizers with chat template formatting |
| Performance | < 50ms prediction target with warnings for slower inference |
| Quantization | INT8 quantized ONNX model for faster inference |
| Graph Optimization | Level 3 ONNX optimization for maximum performance |
- WebSocket Streaming (
/ws) - Real-time bidirectional audio/text with provider switching - REST API - TTS synthesis, voice listing, health checks
- LiveKit Integration - WebRTC rooms, SIP webhooks, participant management
- Multi-Provider Support - Unified interface across 70+ global providers
- Audio Caching - Intelligent TTS response caching with XXH3 hashing
- Rate Limiting - Token bucket per-IP rate limiting with configurable limits
- JWT Authentication - Optional API authentication with external validation
| Feature | Technology | Benefit |
|---|---|---|
| HTTP/2 Connection Pooling | ReqManager | Reduced latency, connection reuse |
| Audio Caching | moka + XXH3 | Sub-millisecond cache lookups |
| Zero-Copy Pipeline | Bytes crate | 4.1x memory improvement |
| Rate Limiting | tower-governor | Token bucket per-IP protection |
| TLS | rustls | No OpenSSL dependency, cross-compilation support |
| Flag | Description | Use Case |
|---|---|---|
dag-routing |
DAG-based pipeline engine | Custom voice workflows, multi-provider orchestration |
turn-detect |
ONNX-based turn detection | Conversational AI, voice agents |
noise-filter |
DeepFilterNet noise suppression | Noisy environments, mobile apps |
openapi |
OpenAPI 3.1 spec generation | API documentation |
# Enable all optional features
cargo build --release --features dag-routing,turn-detect,noise-filter,openapiView All 70+ Supported Providers - Complete documentation for STT, TTS, and Realtime providers across all regions.
WaaV Gateway supports 27 STT providers, 32 TTS providers, and 2 Realtime providers with global coverage including specialized regional providers.
| Category | Providers |
|---|---|
| Global Leaders | Deepgram, Google Cloud, Azure, OpenAI, ElevenLabs, AssemblyAI, Cartesia, AWS Transcribe, IBM Watson, Groq |
| European | Speechmatics, Gladia, Rev AI, Phonexia, Acapela, Cereproc |
| Russia/CIS | Yandex SpeechKit, Tinkoff VoiceKit, SberDevices |
| India | Sarvam AI, Gnani.ai, Reverie, Bhashini |
| China | iFlytek, Alibaba Cloud, Baidu AI, Tencent Cloud, Huawei Cloud |
| East Asia | NAVER CLOVA (Korea), AmiVoice (Japan) |
| Southeast Asia | Zalo AI, FPT.AI, Viettel AI (Vietnam), Prosa.ai (Indonesia), NECTEC (Thailand) |
| Category | Providers |
|---|---|
| Global Leaders | Deepgram, Google Cloud, Azure, OpenAI, ElevenLabs, Cartesia, AWS Polly, IBM Watson |
| Voice Cloning | Hume AI, LMNT, Play.ht, Murf.ai, WellSaid Labs, Resemble AI, Speechify, Unreal Speech, Smallest.ai |
| Regional | Yandex, Tinkoff, SberDevices, Sarvam AI, Gnani.ai, Reverie, Bhashini, iFlytek, Alibaba, Baidu, Tencent, Huawei, NAVER CLOVA, Zalo, FPT, Viettel, Prosa, NECTEC |
| Provider | Protocol | Features |
|---|---|---|
| OpenAI Realtime | WebSocket | GPT-4o full-duplex streaming, function calling, VAD |
| Hume AI EVI | WebSocket | Empathic voice interface, 48 emotion dimensions, prosody analysis |
Feature flag: --features dag-routing
WaaV's DAG (Directed Acyclic Graph) routing system enables flexible, customizable voice processing pipelines with conditional routing, multi-provider orchestration, and parallel processing.
| Feature | Description |
|---|---|
| Custom Pipelines | Chain STT, TTS, LLM, and custom processors in any configuration |
| External Routing | Route to HTTP, gRPC, WebSocket, IPC, and LiveKit endpoints |
| Conditional Logic | Use Rhai expressions or switch patterns for dynamic routing |
| Parallel Processing | Split/Join patterns for concurrent branch execution |
| A/B Testing | Route based on API key identity or custom conditions |
| Low Latency | Pre-compiled graphs with lock-free data passing |
- Input Nodes -
audio_input,text_input - Provider Nodes -
stt_provider,tts_provider,llm_provider - Processor Nodes -
transform,filter,aggregate - Router Nodes -
switch,conditional,split,join - Output Nodes -
text_output,audio_output,webhook - Endpoint Nodes -
http_endpoint,grpc_endpoint,websocket_endpoint
{
"dag": {
"id": "voice-bot",
"nodes": [
{ "id": "input", "type": "audio_input" },
{ "id": "stt", "type": "stt_provider", "provider": "deepgram" },
{ "id": "llm", "type": "llm_provider", "provider": "openai" },
{ "id": "tts", "type": "tts_provider", "provider": "elevenlabs" },
{ "id": "output", "type": "audio_output" }
],
"edges": [
{ "from": "input", "to": "stt" },
{ "from": "stt", "to": "llm" },
{ "from": "llm", "to": "tts" },
{ "from": "tts", "to": "output" }
],
"entry_node": "input",
"exit_nodes": ["output"]
}
}See docs/dag_routing.md for complete documentation.
graph TB
subgraph Clients["Client Applications"]
TS[TypeScript SDK]
PY[Python SDK]
DASH[Dashboard]
WIDG[Widget]
MOB[Mobile Apps]
end
subgraph Gateway["WaaV Gateway (Rust)"]
subgraph Handlers["Request Handlers"]
WS[WebSocket Handler]
REST[REST API<br/>Axum]
LK[LiveKit Integration]
RL[Rate Limiter<br/>tower-governor]
end
subgraph VM["VoiceManager (Central Coordinator)"]
STT[STT Manager<br/>BaseSTT]
TTS[TTS Manager<br/>TTSProvider]
CACHE[Audio Cache<br/>moka+XXH3]
TURN[Turn Detector<br/>ONNX Runtime]
NF[Noise Filter<br/>DeepFilterNet]
end
subgraph Providers["Provider Layer"]
DG[Deepgram<br/>WS + HTTP]
EL[ElevenLabs<br/>WebSocket]
GC[Google<br/>gRPC]
AZ[Azure<br/>WebSocket]
CA[Cartesia<br/>WebSocket]
OAI[OpenAI<br/>STT/TTS/Realtime]
AWS[AWS<br/>Transcribe/Polly]
IBM[IBM Watson<br/>STT/TTS]
GRQ[Groq<br/>REST]
HUM[Hume AI<br/>TTS/EVI]
LMNT[LMNT<br/>HTTP]
end
end
Clients -->|WebSocket / REST / WebRTC| Handlers
WS --> VM
REST --> VM
LK --> VM
STT --> NF
TTS --> NF
STT --> Providers
TTS --> Providers
style Clients fill:#e3f2fd
style Gateway fill:#f5f5f5
style VM fill:#fff3e0
style Providers fill:#e8f5e9
# Clone and build
git clone https://github.com/bud-foundry/waav.git
cd waav/gateway
cargo build --release
# Run with config
./target/release/waav-gateway -c config.yaml# Build image
docker build -t waav-gateway .
# Run container
docker run -p 3001:3001 \
-v $(pwd)/config.yaml:/config.yaml \
-e DEEPGRAM_API_KEY=your-key \
waav-gateway# Enable noise filtering and turn detection
cargo build --release --features turn-detect,noise-filter
# Enable OpenAPI documentation generation
cargo build --release --features openapi
cargo run --features openapi -- openapi -o docs/openapi.yamlIf using the turn-detect feature, download the required model and tokenizer:
cargo run --features turn-detect -- initnpm install @bud-foundry/sdkimport { BudClient } from '@bud-foundry/sdk';
const bud = new BudClient({
baseUrl: 'http://localhost:3001',
apiKey: 'your-api-key' // Optional if auth not required
});
// Speech-to-Text
const stt = await bud.stt.connect({ provider: 'deepgram' });
stt.on('transcript', (result) => {
console.log(result.is_final ? `Final: ${result.text}` : `Interim: ${result.text}`);
});
await stt.startListening();
// Text-to-Speech
const tts = await bud.tts.connect({ provider: 'elevenlabs' });
await tts.speak('Hello from WaaV!');
// Bidirectional Voice
const talk = await bud.talk.connect({
stt: { provider: 'deepgram' },
tts: { provider: 'elevenlabs' }
});
await talk.startListening();
// OpenAI STT/TTS
const sttOpenAI = await bud.stt.connect({
provider: 'openai',
model: 'whisper-1'
});
const ttsOpenAI = await bud.tts.connect({
provider: 'openai',
model: 'tts-1-hd',
voice: 'nova'
});
await ttsOpenAI.speak('Hello from OpenAI!');
// Hume AI TTS with emotion control
const ttsHume = await bud.tts.connect({
provider: 'hume',
voice: 'Kora',
emotion: 'happy',
emotionIntensity: 0.8,
deliveryStyle: 'cheerful'
});
await ttsHume.speak('Hello from Hume AI with emotion!');Features:
- Full STT/TTS streaming with typed events
- MetricsCollector for latency tracking (TTFT, connection time)
- Automatic reconnection with exponential backoff
- Browser and Node.js support
pip install bud-foundryfrom bud_foundry import BudClient
bud = BudClient(base_url="http://localhost:3001", api_key="your-api-key")
# Speech-to-Text
async with bud.stt.connect(provider="deepgram") as session:
async for result in session.transcribe_stream(audio_generator()):
print(f"Transcript: {result.text}")
# Text-to-Speech
async with bud.tts.connect(provider="elevenlabs") as session:
await session.speak("Hello from WaaV!")
# Bidirectional Voice
async with bud.talk.connect(
stt={"provider": "deepgram"},
tts={"provider": "elevenlabs"}
) as session:
async for event in session:
if event.type == "transcript":
print(event.text)
# OpenAI STT/TTS
async with bud.stt.connect(provider="openai", model="whisper-1") as session:
async for result in session.transcribe_stream(audio_generator()):
print(f"Transcript: {result.text}")
async with bud.tts.connect(provider="openai", model="tts-1-hd", voice="nova") as session:
await session.speak("Hello from OpenAI!")
# Hume AI TTS with emotion control
async with bud.tts.connect(
provider="hume",
voice="Kora",
emotion="happy",
emotion_intensity=0.8,
delivery_style="cheerful"
) as session:
await session.speak("Hello from Hume AI with emotion!")Features:
- Async/await native support
- Context manager for automatic cleanup
- Streaming iterators
- Type hints (PEP 484)
A web-based testing interface for development:
cd clients_sdk/dashboard
npm install && npm run devFeatures:
- Real-time transcription display
- TTS synthesis panel with voice selection
- Metrics visualization (latency charts)
- WebSocket message inspector
- Provider switching
Drop-in voice widget for web applications:
<script type="module">
import { BudWidget } from '@bud-foundry/widget';
</script>
<bud-widget
server="ws://localhost:3001/ws"
provider="deepgram"
mode="push-to-talk"
theme="dark">
</bud-widget>| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check (returns {"status":"ok"}) |
/voices |
GET | List available TTS voices for a provider |
/speak |
POST | Synthesize speech from text |
/livekit/token |
POST | Generate LiveKit participant token |
/recording/{stream_id} |
GET | Download recording from S3 |
/sip/hooks |
GET/POST | Manage SIP webhook hooks |
/realtime |
WebSocket | OpenAI Realtime audio-to-audio streaming |
Connect to ws://host:3001/ws for real-time voice processing.
Configuration Message (JSON):
{
"action": "configure",
"provider": "deepgram",
"model": "nova-3",
"stt_config": {
"interim_results": true,
"punctuation": true,
"language": "en-US"
}
}Audio Data: Send raw PCM audio as binary WebSocket frames (16-bit signed LE, mono).
Response Messages:
// Ready message (after configuration)
{"type": "ready", "stream_id": "abc123"}
// Transcript message
{"type": "transcript", "text": "Hello world", "is_final": true}
// TTS audio (binary frame with header)
[binary audio data]
// Error message
{"type": "error", "message": "Provider connection failed"}curl -X POST http://localhost:3001/speak \
-H "Content-Type: application/json" \
-d '{
"text": "Welcome to WaaV Gateway!",
"tts_config": {
"provider": "deepgram",
"voice": "aura-asteria-en",
"sample_rate": 24000
}
}' \
--output speech.pcmResponse Headers:
Content-Type: audio/pcmX-Audio-Format: linear16X-Sample-Rate: 24000
WaaV uses YAML configuration with environment variable overrides. Create config.yaml:
# Server configuration
server:
host: "0.0.0.0"
port: 3001
tls:
enabled: false
cert_path: "/path/to/cert.pem"
key_path: "/path/to/key.pem"
# Security settings
security:
rate_limit_requests_per_second: 60 # ENV: RATE_LIMIT_REQUESTS_PER_SECOND
rate_limit_burst_size: 10 # ENV: RATE_LIMIT_BURST_SIZE
max_connections_per_ip: 100
# Provider API keys
providers:
deepgram_api_key: "" # ENV: DEEPGRAM_API_KEY
elevenlabs_api_key: "" # ENV: ELEVENLABS_API_KEY
google_credentials: "" # ENV: GOOGLE_APPLICATION_CREDENTIALS
azure_speech_subscription_key: "" # ENV: AZURE_SPEECH_SUBSCRIPTION_KEY
azure_speech_region: "eastus" # ENV: AZURE_SPEECH_REGION
cartesia_api_key: "" # ENV: CARTESIA_API_KEY
openai_api_key: "" # ENV: OPENAI_API_KEY
aws_access_key_id: "" # ENV: AWS_ACCESS_KEY_ID
aws_secret_access_key: "" # ENV: AWS_SECRET_ACCESS_KEY
aws_region: "us-east-1" # ENV: AWS_REGION
ibm_watson_api_key: "" # ENV: IBM_WATSON_API_KEY
ibm_watson_instance_id: "" # ENV: IBM_WATSON_INSTANCE_ID
ibm_watson_region: "us-south" # ENV: IBM_WATSON_REGION
groq_api_key: "" # ENV: GROQ_API_KEY
hume_api_key: "" # ENV: HUME_API_KEY
lmnt_api_key: "" # ENV: LMNT_API_KEY
# LiveKit configuration (optional)
livekit:
url: "ws://localhost:7880" # ENV: LIVEKIT_URL
public_url: "http://localhost:7880" # ENV: LIVEKIT_PUBLIC_URL
api_key: "devkey" # ENV: LIVEKIT_API_KEY
api_secret: "secret" # ENV: LIVEKIT_API_SECRET
# Authentication (optional)
auth:
required: false # ENV: AUTH_REQUIRED
service_url: "" # ENV: AUTH_SERVICE_URL
signing_key_path: "" # ENV: AUTH_SIGNING_KEY_PATH
# Caching
cache:
path: "/var/cache/waav-gateway" # ENV: CACHE_PATH
ttl_seconds: 2592000 # ENV: CACHE_TTL_SECONDS (30 days)
# Recording storage (S3)
recording:
s3_bucket: "my-recordings" # ENV: RECORDING_S3_BUCKET
s3_region: "us-west-2" # ENV: RECORDING_S3_REGIONPriority: Environment Variables > YAML File > Defaults
Tested with mock providers (0ms provider latency) to measure pure gateway overhead:
| Metric | Value | Target | Status |
|---|---|---|---|
| Peak RPS | 112,528 | 10,000 | 11x exceeded |
| Gateway P50 | 0.343ms | - | Excellent |
| Gateway P99 | 1.384ms | <3-5ms | PASS |
| Memory (RSS) | 38MB | - | Very efficient |
| Max Concurrent Users | 28,000 | 10,000 | 2.8x exceeded |
| Breaking Point | 28,500 VUs | - | Identified |
| Error Rate | 0.00% | <1% | PASS |
Note: Total end-to-end latency = Gateway overhead + Provider latency. Provider latency varies by cloud provider.
| Concurrent Users | RPS | P99 Latency | Error Rate |
|---|---|---|---|
| 50 (optimal) | 104,462 | 1.65ms | 0.00% |
| 1,000 | 41,401 | 28.66ms | 0.00% |
| 5,000 | 32,273 | 130ms | 0.00% |
| 10,000 | 34,253 | 288ms | 0.00% |
| 28,000 | 2,618 | 28.7s | 0.00% |
| Test | Result |
|---|---|
| SIGSTOP/SIGCONT (3s freeze) | Recovered immediately |
| Concurrency Spike (10→500→10 VUs) | 100% success |
| Rapid Connections (1000/sec) | No FD leaks |
| Malformed JSON injection | Properly rejected |
| Oversized payload (1MB) | Properly rejected |
- HTTP/2 Connection Pooling - Persistent connections with automatic warmup
- Audio Response Caching - XXH3 content hashing for intelligent cache keys
- Zero-Copy Pipeline -
Bytescrate for 4.1x memory improvement - Token Bucket Rate Limiting - Per-IP protection with configurable limits
- AtomicU64 Cache Metrics - 2.13x faster under concurrent load
- Release Profile - LTO, single codegen unit, stripped binaries
# Cargo.toml release profile
[profile.release]
opt-level = 3
lto = "thin"
codegen-units = 1
strip = true
panic = "abort"# Clone and setup
git clone https://github.com/bud-foundry/waav.git
cd waav/gateway
# Run development server
cargo run -- -c config.yaml
# Run tests
cargo test
# Code style
cargo fmt && cargo clippy# Generate OpenAPI spec
cargo run --features openapi -- openapi -o docs/openapi.yaml
# View API docs
open docs/openapi.yamlwaav/
├── gateway/ # Rust gateway server
│ ├── src/
│ │ ├── core/ # STT/TTS providers, voice manager
│ │ ├── dag/ # DAG pipeline engine (conditional routing)
│ │ ├── handlers/ # WebSocket and REST handlers
│ │ ├── livekit/ # LiveKit integration
│ │ └── utils/ # Noise filter, caching, HTTP pooling
│ ├── docs/ # API documentation
│ └── tests/ # Integration tests
├── clients_sdk/
│ ├── typescript/ # TypeScript SDK
│ ├── python/ # Python SDK
│ ├── dashboard/ # Testing dashboard
│ └── widget/ # Embeddable widget
└── assets/ # Logo and static assets
WaaV Gateway is licensed under the Apache License 2.0.
Built with Rust by Bud Foundry
