Real-time video chat with AI — it can see you and hear you, then talks back.
Built with Groq APIs for blazing-fast inference. Single file server, no frameworks, runs locally.
🎤 You speak → Groq Whisper (STT)
📷 Camera frame → Groq Llama 4 Scout (Vision)
↓ (parallel)
🧠 Groq Llama 3.3 70B (Conversation) → combines what it heard + saw
↓
🔊 edge-tts (Text-to-Speech) → AI speaks back
All processing runs through Groq's API — no local GPU needed. Typical round-trip: 2-4 seconds.
Sign up at console.groq.com and create an API key.
git clone https://github.com/littleshuai-bot/ai-video-chat.git
cd ai-video-chat
# Set your API key
export GROQ_API_KEY=gsk_your_key_here
# Install dependencies
pip install -r requirements.txt
# Run
python server.pyGo to http://localhost:8765 → allow camera & microphone → click 🎤 to talk.
Copy .env.example to .env and customize:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
(required) | Your Groq API key |
AGENT_NAME |
AI Assistant |
Name displayed on the AI avatar |
USER_NAME |
You |
Name displayed on your video |
PORT |
8765 |
Server port |
LANGUAGE |
zh |
STT language code (en, zh, ja, ko, es, fr, etc.) |
TTS_VOICE |
zh-CN-XiaoxiaoNeural |
edge-tts voice (list voices) |
LLM_MODEL |
llama-3.3-70b-versatile |
Groq LLM model for conversation |
VISION_MODEL |
meta-llama/llama-4-scout-17b-16e-instruct |
Groq vision model |
AGENT_PERSONA |
(auto-generated) | Custom system prompt override |
English:
LANGUAGE=en
TTS_VOICE=en-US-AriaNeuralChinese:
LANGUAGE=zh
TTS_VOICE=zh-CN-XiaoxiaoNeuralJapanese:
LANGUAGE=ja
TTS_VOICE=ja-JP-NanamiNeural- Python 3.10+
- ffmpeg — for audio conversion (
brew install ffmpeg/apt install ffmpeg) - Groq API key — free tier at console.groq.com
- Modern browser with camera & microphone support
┌─────────────────────────────────────────────────┐
│ Browser (UI) │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Camera │ │ AI Avatar │ │
│ │ (user) │ │ + Subtitles │ │
│ └──────────┘ └──────────────────┘ │
│ 🎤 Record → POST /api/chat (audio+image) │
│ ← { text, audio_url } │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Python Server (FastAPI) │
│ │
│ Audio ──→ [ffmpeg] ──→ Groq Whisper (STT) │
│ Image ──→ Groq Llama 4 Scout (Vision) ← parallel
│ ↓ │
│ transcript + scene ──→ Groq Llama 3.3 (LLM) │
│ ↓ │
│ reply text ──→ edge-tts (TTS) ──→ MP3 │
└─────────────────────────────────────────────────┘
The frontend is a single HTML file with no build step. The backend is a single Python file with FastAPI.
- 🎤 Voice Input — press to record, release to send
- 📷 Vision — AI can see your camera feed
- 🔊 Voice Output — AI speaks its replies
- 💬 Subtitles — typewriter-style text animation
- ⏱️ Call Timer — FaceTime-style UI
- 📱 Responsive — works on mobile & desktop
- 🌍 Multi-language — configurable STT language and TTS voice
- 🎭 Custom Persona — fully customizable AI personality
| Component | Technology | Why |
|---|---|---|
| STT | Groq Whisper Large v3 Turbo | Fastest Whisper inference available |
| Vision | Groq Llama 4 Scout | Multimodal understanding |
| LLM | Groq Llama 3.3 70B | Fast, high-quality conversation |
| TTS | edge-tts | Free, many voices, low latency |
| Server | FastAPI + uvicorn | Async Python, minimal overhead |
| Frontend | Vanilla HTML/CSS/JS | No build step, just works |
MIT
Built by ExtraSmall ✨