Skip to content

Latest commit

 

History

History
148 lines (112 loc) · 5.15 KB

File metadata and controls

148 lines (112 loc) · 5.15 KB

🎥 AI Video Chat

Real-time video chat with AI — it can see you and hear you, then talks back.

Built with Groq APIs for blazing-fast inference. Single file server, no frameworks, runs locally.

How It Works

🎤 You speak → Groq Whisper (STT)
📷 Camera frame → Groq Llama 4 Scout (Vision)
       ↓ (parallel)
🧠 Groq Llama 3.3 70B (Conversation) → combines what it heard + saw
       ↓
🔊 edge-tts (Text-to-Speech) → AI speaks back

All processing runs through Groq's API — no local GPU needed. Typical round-trip: 2-4 seconds.

Quick Start

1. Get a Groq API Key (free)

Sign up at console.groq.com and create an API key.

2. Install & Run

git clone https://github.com/littleshuai-bot/ai-video-chat.git
cd ai-video-chat

# Set your API key
export GROQ_API_KEY=gsk_your_key_here

# Install dependencies
pip install -r requirements.txt

# Run
python server.py

3. Open in Browser

Go to http://localhost:8765 → allow camera & microphone → click 🎤 to talk.

Configuration

Copy .env.example to .env and customize:

cp .env.example .env
Variable Default Description
GROQ_API_KEY (required) Your Groq API key
AGENT_NAME AI Assistant Name displayed on the AI avatar
USER_NAME You Name displayed on your video
PORT 8765 Server port
LANGUAGE zh STT language code (en, zh, ja, ko, es, fr, etc.)
TTS_VOICE zh-CN-XiaoxiaoNeural edge-tts voice (list voices)
LLM_MODEL llama-3.3-70b-versatile Groq LLM model for conversation
VISION_MODEL meta-llama/llama-4-scout-17b-16e-instruct Groq vision model
AGENT_PERSONA (auto-generated) Custom system prompt override

Language Examples

English:

LANGUAGE=en
TTS_VOICE=en-US-AriaNeural

Chinese:

LANGUAGE=zh
TTS_VOICE=zh-CN-XiaoxiaoNeural

Japanese:

LANGUAGE=ja
TTS_VOICE=ja-JP-NanamiNeural

Requirements

  • Python 3.10+
  • ffmpeg — for audio conversion (brew install ffmpeg / apt install ffmpeg)
  • Groq API key — free tier at console.groq.com
  • Modern browser with camera & microphone support

Architecture

┌─────────────────────────────────────────────────┐
│                  Browser (UI)                    │
│  ┌──────────┐              ┌──────────────────┐  │
│  │  Camera   │              │   AI Avatar      │  │
│  │  (user)   │              │   + Subtitles    │  │
│  └──────────┘              └──────────────────┘  │
│        🎤 Record → POST /api/chat (audio+image)  │
│                           ← { text, audio_url }  │
└─────────────────────────────────────────────────┘
                        │
┌─────────────────────────────────────────────────┐
│              Python Server (FastAPI)             │
│                                                  │
│  Audio ──→ [ffmpeg] ──→ Groq Whisper (STT)      │
│  Image ──→ Groq Llama 4 Scout (Vision)    ← parallel
│                    ↓                             │
│  transcript + scene ──→ Groq Llama 3.3 (LLM)   │
│                    ↓                             │
│  reply text ──→ edge-tts (TTS) ──→ MP3          │
└─────────────────────────────────────────────────┘

The frontend is a single HTML file with no build step. The backend is a single Python file with FastAPI.

Features

  • 🎤 Voice Input — press to record, release to send
  • 📷 Vision — AI can see your camera feed
  • 🔊 Voice Output — AI speaks its replies
  • 💬 Subtitles — typewriter-style text animation
  • ⏱️ Call Timer — FaceTime-style UI
  • 📱 Responsive — works on mobile & desktop
  • 🌍 Multi-language — configurable STT language and TTS voice
  • 🎭 Custom Persona — fully customizable AI personality

How It's Built

Component Technology Why
STT Groq Whisper Large v3 Turbo Fastest Whisper inference available
Vision Groq Llama 4 Scout Multimodal understanding
LLM Groq Llama 3.3 70B Fast, high-quality conversation
TTS edge-tts Free, many voices, low latency
Server FastAPI + uvicorn Async Python, minimal overhead
Frontend Vanilla HTML/CSS/JS No build step, just works

License

MIT


Built by ExtraSmall