Skip to content

Feature: Voice Input/Output Support via OpenAI Realtime API #13

@cyl19970726

Description

@cyl19970726

Summary

Add comprehensive voice input and output capabilities to the MiniAgent framework using OpenAI's Realtime API (gpt-4o-realtime-preview).

Background

The current framework supports text-based interactions with LLM providers. This feature request aims to extend support for real-time voice conversations, enabling users to interact with agents through speech input and receive audio responses.

OpenAI Realtime API Overview

  • Model: gpt-4o-realtime-preview-2024-12-17
  • Transport: WebRTC (preferred) or WebSocket
  • Capabilities: End-to-end speech-to-speech with <1s latency
  • Modalities: Configurable text/audio input/output combinations
  • Pricing: $20/1M tokens (cached audio in), $40/1M tokens (audio out)

Architecture Changes Required

1. New Voice Provider Interface

Create a new voice-enabled chat provider alongside existing GeminiChat and OpenAIChatResponse:

// src/chat/openaiVoiceChat.ts
export class OpenAIVoiceChat implements IVoiceChat {
  private realtime: OpenAIRealtimeWS;
  private audioProcessor: AudioProcessor;
  
  async sendAudioStream(
    audioStream: MediaStream,
    config: VoiceConfig
  ): AsyncGenerator<VoiceResponse>;
  
  async sendTextStream(
    text: string,
    config: VoiceConfig
  ): AsyncGenerator<VoiceResponse>;
}

2. Voice Configuration Extensions

Extend existing configuration to support voice parameters:

interface VoiceConfig {
  voice: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer';
  modalities: ('text' | 'audio')[];
  turnDetection: 'server_vad' | 'manual';
  audioFormat: 'pcm16' | 'opus';
  sampleRate: 24000;
}

3. Audio Processing Layer

Add audio processing utilities:

// src/audio/audioProcessor.ts
class AudioProcessor {
  encodeAudio(buffer: Float32Array): string; // Base64 for API
  decodeAudio(base64Data: string): Float32Array;
  resample(input: Float32Array, fromRate: number, toRate: number): Float32Array;
}

4. Event System Extensions

Extend AgentEventType to include voice-specific events:

enum AgentEventType {
  VoiceStart = 'voice.start',
  VoiceAudioDelta = 'voice.audio.delta',
  VoiceTextDelta = 'voice.text.delta',
  VoiceComplete = 'voice.complete',
  VoiceError = 'voice.error'
}

5. Browser/Node Compatibility

  • Browser: Use Web Audio API + WebRTC
  • Node: Use native audio modules or external libraries
  • Streaming: Implement chunked audio processing

Implementation Plan

Phase 1: Core Voice Integration (Priority: High)

  • Create OpenAIVoiceChat class implementing voice streaming
  • Add voice configuration to AllConfig interface
  • Implement basic audio encoding/decoding
  • Add voice-specific event types

Phase 2: Audio Processing (Priority: High)

  • Implement audio stream handling
  • Add resampling utilities for different audio formats
  • Create audio buffer management
  • Handle audio device access (microphone/speakers)

Phase 3: Provider Integration (Priority: Medium)

  • Update StandardAgent to support voice providers
  • Add voice mode detection in agent factory
  • Implement session management for voice sessions
  • Add voice-specific error handling

Phase 4: Testing & Examples (Priority: Medium)

  • Create voice-enabled example applications
  • Add comprehensive test suite for audio processing
  • Documentation for voice integration
  • Performance benchmarks for audio streaming

Phase 5: Advanced Features (Priority: Low)

  • Voice activity detection (VAD) customization
  • Multi-language voice support
  • Voice interruption handling
  • Audio effects and voice customization

API Usage Examples

Basic Voice Chat

import { StandardAgent } from '@continue-reasoning/mini-agent';

const agent = new StandardAgent(tools, {
  chatConfig: {
    provider: 'openai-voice',
    voice: 'alloy',
    modalities: ['text', 'audio']
  }
});

// Process voice input
for await (const event of agent.processWithVoice(userAudioStream)) {
  if (event.type === 'voice.audio.delta') {
    playAudio(event.data.audio);
  }
}

Text-to-Speech Mode

const agent = new StandardAgent(tools, {
  chatConfig: {
    provider: 'openai-voice',
    modalities: ['text'] // input, audio output
  }
});

// Text input, audio output
for await (const event of agent.process('Hello, how are you?')) {
  if (event.type === 'voice.audio.delta') {
    playAudio(event.data.audio);
  }
}

Technical Considerations

Security & Privacy

  • Audio data encryption in transit
  • Local audio processing where possible
  • User consent for microphone access
  • Audio data retention policies

Performance

  • Audio chunk size optimization (20ms opus frames)
  • Buffer management for low latency
  • Connection lifecycle management (30min session limit)
  • Fallback to text mode on connection issues

Browser Compatibility

  • WebRTC support matrix
  • Audio API compatibility
  • Mobile browser considerations
  • Progressive enhancement for non-voice environments

Dependencies

  • openai package (already included)
  • Web Audio API (browser)
  • Node.js audio modules (optional)
  • WebRTC libraries (optional)

Breaking Changes

None - this is additive functionality that maintains backward compatibility with existing text-based interfaces.

Testing Strategy

  1. Unit tests for audio processing utilities
  2. Integration tests with OpenAI Realtime API
  3. Browser compatibility testing
  4. Performance benchmarks for latency
  5. End-to-end voice conversation tests

Documentation Updates

  • Voice integration guide
  • API reference for voice configuration
  • Browser setup instructions
  • Troubleshooting audio issues
  • Performance optimization tips

Success Criteria

  • Sub-second voice response latency
  • Support for all OpenAI voice models
  • Cross-browser compatibility
  • Comprehensive test coverage (>80%)
  • Production-ready examples
  • Complete documentation

Related Issues

  • May require updates to session management for audio session state
  • Consider integration with existing token tracking for voice usage
  • Future enhancement: Multi-provider voice support (Google, Azure, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions