-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add comprehensive voice input and output capabilities to the MiniAgent framework using OpenAI's Realtime API (gpt-4o-realtime-preview).
Background
The current framework supports text-based interactions with LLM providers. This feature request aims to extend support for real-time voice conversations, enabling users to interact with agents through speech input and receive audio responses.
OpenAI Realtime API Overview
- Model: gpt-4o-realtime-preview-2024-12-17
- Transport: WebRTC (preferred) or WebSocket
- Capabilities: End-to-end speech-to-speech with <1s latency
- Modalities: Configurable text/audio input/output combinations
- Pricing: $20/1M tokens (cached audio in), $40/1M tokens (audio out)
Architecture Changes Required
1. New Voice Provider Interface
Create a new voice-enabled chat provider alongside existing GeminiChat and OpenAIChatResponse:
// src/chat/openaiVoiceChat.ts
export class OpenAIVoiceChat implements IVoiceChat {
private realtime: OpenAIRealtimeWS;
private audioProcessor: AudioProcessor;
async sendAudioStream(
audioStream: MediaStream,
config: VoiceConfig
): AsyncGenerator<VoiceResponse>;
async sendTextStream(
text: string,
config: VoiceConfig
): AsyncGenerator<VoiceResponse>;
}2. Voice Configuration Extensions
Extend existing configuration to support voice parameters:
interface VoiceConfig {
voice: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer';
modalities: ('text' | 'audio')[];
turnDetection: 'server_vad' | 'manual';
audioFormat: 'pcm16' | 'opus';
sampleRate: 24000;
}3. Audio Processing Layer
Add audio processing utilities:
// src/audio/audioProcessor.ts
class AudioProcessor {
encodeAudio(buffer: Float32Array): string; // Base64 for API
decodeAudio(base64Data: string): Float32Array;
resample(input: Float32Array, fromRate: number, toRate: number): Float32Array;
}4. Event System Extensions
Extend AgentEventType to include voice-specific events:
enum AgentEventType {
VoiceStart = 'voice.start',
VoiceAudioDelta = 'voice.audio.delta',
VoiceTextDelta = 'voice.text.delta',
VoiceComplete = 'voice.complete',
VoiceError = 'voice.error'
}5. Browser/Node Compatibility
- Browser: Use Web Audio API + WebRTC
- Node: Use native audio modules or external libraries
- Streaming: Implement chunked audio processing
Implementation Plan
Phase 1: Core Voice Integration (Priority: High)
- Create OpenAIVoiceChat class implementing voice streaming
- Add voice configuration to AllConfig interface
- Implement basic audio encoding/decoding
- Add voice-specific event types
Phase 2: Audio Processing (Priority: High)
- Implement audio stream handling
- Add resampling utilities for different audio formats
- Create audio buffer management
- Handle audio device access (microphone/speakers)
Phase 3: Provider Integration (Priority: Medium)
- Update StandardAgent to support voice providers
- Add voice mode detection in agent factory
- Implement session management for voice sessions
- Add voice-specific error handling
Phase 4: Testing & Examples (Priority: Medium)
- Create voice-enabled example applications
- Add comprehensive test suite for audio processing
- Documentation for voice integration
- Performance benchmarks for audio streaming
Phase 5: Advanced Features (Priority: Low)
- Voice activity detection (VAD) customization
- Multi-language voice support
- Voice interruption handling
- Audio effects and voice customization
API Usage Examples
Basic Voice Chat
import { StandardAgent } from '@continue-reasoning/mini-agent';
const agent = new StandardAgent(tools, {
chatConfig: {
provider: 'openai-voice',
voice: 'alloy',
modalities: ['text', 'audio']
}
});
// Process voice input
for await (const event of agent.processWithVoice(userAudioStream)) {
if (event.type === 'voice.audio.delta') {
playAudio(event.data.audio);
}
}Text-to-Speech Mode
const agent = new StandardAgent(tools, {
chatConfig: {
provider: 'openai-voice',
modalities: ['text'] // input, audio output
}
});
// Text input, audio output
for await (const event of agent.process('Hello, how are you?')) {
if (event.type === 'voice.audio.delta') {
playAudio(event.data.audio);
}
}Technical Considerations
Security & Privacy
- Audio data encryption in transit
- Local audio processing where possible
- User consent for microphone access
- Audio data retention policies
Performance
- Audio chunk size optimization (20ms opus frames)
- Buffer management for low latency
- Connection lifecycle management (30min session limit)
- Fallback to text mode on connection issues
Browser Compatibility
- WebRTC support matrix
- Audio API compatibility
- Mobile browser considerations
- Progressive enhancement for non-voice environments
Dependencies
openaipackage (already included)- Web Audio API (browser)
- Node.js audio modules (optional)
- WebRTC libraries (optional)
Breaking Changes
None - this is additive functionality that maintains backward compatibility with existing text-based interfaces.
Testing Strategy
- Unit tests for audio processing utilities
- Integration tests with OpenAI Realtime API
- Browser compatibility testing
- Performance benchmarks for latency
- End-to-end voice conversation tests
Documentation Updates
- Voice integration guide
- API reference for voice configuration
- Browser setup instructions
- Troubleshooting audio issues
- Performance optimization tips
Success Criteria
- Sub-second voice response latency
- Support for all OpenAI voice models
- Cross-browser compatibility
- Comprehensive test coverage (>80%)
- Production-ready examples
- Complete documentation
Related Issues
- May require updates to session management for audio session state
- Consider integration with existing token tracking for voice usage
- Future enhancement: Multi-provider voice support (Google, Azure, etc.)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request