Skip to content

AnupamKumar-1/Hoovik

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

363 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Hoovik — Video meetings that think. In real time.



Stars   CI    

     


If Hoovik has been useful, please give it a ⭐ — it takes 2 seconds and means the world.


Live Demo



Hoovik demo


✨ What Makes Hoovik Different

Most video tools record a meeting and leave you with a wall of audio. Hoovik goes further — every meeting is automatically transcribed, each speaker's emotional tone is tracked per segment, and an LLM generates a structured summary with a discrepancy report that flags where what someone said didn't match how they felt.


Feature Details
🎥 P2P Video & Audio WebRTC — streams never touch the backend
😮 Live Emotion AI Facial landmarks + audio → ~300–500 ms P50 latency
📝 Auto Transcription Whisper ASR per speaker, delivered post-meeting
🤖 AI Meeting Summary Groq LLM summary + NLP-vs-live emotion discrepancy detection
🔍 RAG Q&A Transcripts indexed as Nomic vector chunks; MMR-reranked semantic search; Groq LLM answers with session history
Distributed Backend 3 pm2 processes unified by Redis pub/sub
🔒 Auth & Rate Limiting JWT + refresh rotation, Redis Lua locks, account lockout

🔬 Flagship Services

the Emotion Service runs live during the meeting, the Transcript Service runs after it ends.


😮 Real-Time Emotion Service — live inference during your meeting

Per-participant multimodal emotion pipeline running at ~300–500 ms P50 latency (load-tested with 10 concurrent participants, 2026-05-07).

How it works

Video frame (JPEG)          Audio chunk (Float32 PCM)
        │                              │
        ▼                              ▼
MediaPipe face landmarks         Wav2Vec2 embedding
(136 landmarks + 51 blendshapes  (audeering/wav2vec2-large-
 + head pose)                     robust-12-ft-emotion-msp-dim)
        │                              │
        └──────────────┬───────────────┘
                       ▼
              Z-score normalisation
               (norm_stats.npz)
                  │           │
                  ▼           ▼
           Ensemble        Anomaly detection
      (EmotionTransformer  (per-modality IsolationForest
         + XGBoost,         + PCA; flags suspect cycles)
       temp-calibrated)
                  │           │
                  └─────┬─────┘
                        ▼
               EMA smoothing (α=0.65, TTL=2 s)
                        ▼
              emotion.result → frontend overlay

What you get — live, during the call

  • Per-participant emotion overlay — updated every ~300–500 ms directly in the video UI
  • Modality fallback — gracefully degrades: both → audio_only → video_only if a modality drops
  • Anomaly flagging — IsolationForest + PCA detects and suppresses suspect inference cycles
  • Live stats dashboardGET /stats (browser) and GET /stats/json expose P50 / P90 / P95 per modality + active participant count
  • Backpressure — server throttles clients automatically when face queue depth hits 3

Socket.IO events

Event Direction Payload
emotion.frame Frontend → Service { meetingId, participantId, buffer: Uint8Array } — JPEG frame
audio_chunk Frontend → Service Uint8Array — 1600-sample Float32 PCM at 16 kHz
participant.media_state Frontend → Service { participantId, micEnabled, cameraEnabled } — immediate modality update on mic/cam toggle
emotion.result Service → Frontend { participantId, result: { emotion, confidence, modality, anomaly } }
server.status Service → Frontend { targetFps, modalityStaleSec } — server capability negotiation
backpressure Service → Frontend { suggestedFps } — server requests reduced capture rate
emotion.error Service → Frontend { code } — inference or auth error for a participant

Model stack

Layer Model / Library
Face landmarks MediaPipe (136 landmarks + 51 blendshapes + head pose)
Audio embedding audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
Fusion classifier Custom EmotionTransformer (PyTorch) + XGBoost (temp-calibrated)
Anomaly filter Per-modality IsolationForest + PCA
Smoothing EMA α=0.65, TTL=2 s

Health endpoints

GET /health   → 200 OK if all models loaded
GET /ready    → 200 OK if service is accepting connections
GET /stats    → live performance dashboard (browser)
GET /stats/json → machine-readable P50/P90/P95 + participant count

📝 Transcript Service — AI summaries & insights delivered post-meeting

Hoovik's post-meeting pipeline — every meeting is automatically transcribed, per-segment emotion is classified, and a Groq LLM generates a structured summary with a discrepancy report that flags where what someone said didn't match how they felt.

How it works

Meeting ends
    │
    ▼
Browser uploads audio blob(s) → Transcript Service (HTTP multipart)
    │
    ▼  HTTP 202 returned immediately; pipeline runs in background
    ▼
ffmpeg → mono 16 kHz WAV
    │
    ▼
Whisper (small) transcribes → raw segments
    │
    ▼
Consecutive segments merged (gap ≤ 2 s, ≤ 60 words)
    │
    ▼
DistilRoBERTa classifies emotion per segment
    │
    ▼
Cross-speaker segments merged + sorted by timestamp
    │
    ▼
build_intelligent_summary() → summary, key points, emotion distribution,
                               top topics, per-speaker stats, WPM
    │
    ▼
HTTP POST callback → Backend (3 retries: 5 s → 15 s → 30 s on network/5xx)
    │
    ▼
Backend stores transcript + metadata in MongoDB
    │
    ▼
POST /api/v1/transcripts/:id/summary  (rate-limited: 2× per 2 hours)
    │
    ▼
Groq LLM annotates Whisper segments with live facial/audio emotion per speaker
    + detects NLP-vs-live discrepancies
    → returns structured summary + `discrepancies[]` array

What you get — after every meeting

  • Full transcript — timestamped, speaker-attributed segments
  • Per-segment emotion — what the speaker's tone was at each moment (DistilRoBERTa)
  • AI Summary — structured LLM-generated recap of the whole meeting
  • Discrepancy Report — segments where a speaker's detected live emotion contradicted their spoken sentiment (e.g. saying "that's fine" with stressed vocal tone and a tense facial expression)

Pipeline entry point

POST /process_meeting
Content-Type: multipart/form-data

audio_files[]: <blob>        # one file per speaker
meeting_code: "ABC123"
speaker_map: {"alice": "Alice"}   # filename-base → display name
x-host-secret: <secret>
x-user-token: <jwt>          # optional

Returns HTTP 202 immediately. Processing happens in the background.

AI Summary API

POST /api/v1/transcripts/:id/summary
Content-Type: application/json

{ "emotionData": {...}, "emotionNames": {...} }

Response:

{
  "summary": "...",
  "key_points": ["..."],
  "discrepancies": [
    {
      "speaker": "Alice",
      "segment": "That timeline works for me.",
      "nlp_emotion": "positive",
      "live_emotion": "stressed"
    }
  ],
  "insights": {
    "dominant_emotion": "neutral",
    "emotion_distribution": { "neutral": 60, "joy": 25, "anger": 15 },
    "speaker_stats": {
      "Alice": { "turns": 12, "dominant_emotion": "neutral", "word_count": 342 }
    },
    "top_topics": ["deadline", "budget", "Q3"],
    "speaking_pace_wpm": 148
  }
}

Note: After the meeting ends, the frontend polls for transcript availability using exponential backoff — delays of 5 s → 10 s → 20 s → 40 s (±20% jitter), then repeating at 40-second intervals up to a 10-minute wall clock cap. No fixed polling interval or fixed attempt count is used. Summary generation is rate-limited to 2 requests per 2 hours per transcript.

Retry & delivery guarantees

Scenario Behaviour
Network error or 5xx from backend Retried at 5 s → 15 s → 30 s (3 attempts)
4xx from backend Not retried — logged and dropped
Empty merged-segment result No callback sent (silent skip)
timeout=None on POST Hung backend blocks thread indefinitely — known limitation

🏗 Architecture

System diagram
graph TD
    Browser["Browser (React SPA)"]

    subgraph Backend ["Backend — Node.js (pm2: ports 8000–8002)"]
        SIO_B["Socket.IO · signalling · chat · room state"]
        REST["REST API · /api/v1/..."]
    end

    subgraph EmotionSvc ["Emotion Service — Python (port 5002)"]
        SIO_E["Socket.IO · per-participant inference"]
    end

    subgraph TranscriptSvc ["Transcript Service — Python (port 5001)"]
        HTTP_T["POST /process_meeting → HTTP 202"]
    end

    subgraph Persistence
        Mongo[("MongoDB")]
        Redis[("Redis · ephemeral + locks + pub/sub")]
    end

    Browser -- "WebRTC (peer-to-peer)" --> Browser
    Browser -- "Socket.IO" --> SIO_B
    Browser -- "REST" --> REST
    Browser -- "Socket.IO" --> SIO_E
    Browser -- "HTTP multipart" --> HTTP_T

    SIO_B --> Redis
    REST --> Mongo & Redis
    SIO_B --> Mongo
    HTTP_T -- "HTTP POST callback" --> REST
    Redis -- "pub/sub adapter" --> SIO_B
Loading
Deployment topology
graph TD
    FE["Frontend SPA (static / CDN)"]
    LB["Reverse Proxy — sticky sessions required"]

    subgraph PM2 ["Backend — pm2"]
        B0["hoovik-backend-8000"]
        B1["hoovik-backend-8001"]
        B2["hoovik-backend-8002"]
    end

    subgraph Python ["Python Services — uvicorn"]
        ES["Emotion Service :5002"]
        TS["Transcript Service :5001"]
    end

    subgraph Data
        Mongo[("MongoDB")]
        Redis[("Redis")]
    end

    FE --> LB --> B0 & B1 & B2
    FE --> ES
    FE --> TS
    B0 & B1 & B2 --> Mongo & Redis
    Redis -- "pub/sub" --> B0 & B1 & B2
    TS -- "HTTP callback" --> LB
Loading

State map

Store What lives there
MongoDB Users, rooms, meetings, chat history (cap: 500), transcripts, AI summaries
Redis Participant maps, socket-ID arrays, join locks, rate limit counters, account lock flags, TTL caches
In-process — Backend Nothing — all room state is in Redis
In-process — Emotion Service Embedding buffers, EMA state, pump coroutine handles
Browser localStorage JWT, host:<code> secret, emotion data for AI summary

🔧 Key Technical Highlights

Area What was built
WebRTC signalling SDP/ICE relay over Socket.IO; Redis adapter fans events across 3 pm2 processes; distributed join lock (SET NX PX 10000 + Lua CAS) serialises concurrent joins
Multimodal emotion inference MediaPipe (136 landmarks + blendshapes + head pose) + Wav2Vec2 → EmotionTransformer + XGBoost (temp-calibrated) + per-modality IsolationForest anomaly detection → EMA (α=0.65); graceful both/audio_only/video_only modality fallback; ~300–500 ms P50
Browser media pipeline AudioWorklet + AnalyserNode for RMS-gated noise detection; MediaRecorder per participant; SSRC-based active speaker with RMS fallback
Async transcript pipeline HTTP 202 immediately; background: ffmpeg → Whisper (small) → segment merging → DistilRoBERTa per-segment emotion → build_intelligent_summary → HTTP POST callback (3 retries: 5 s → 15 s → 30 s on network/5xx; 4xx not retried)
Multi-process backend 3 pm2 instances via @socket.io/redis-adapter; participant map as Redis Hash (HSET/HDEL per event); no in-process room state
Auth & rate limiting JWT + HttpOnly refresh token rotation; Redis Lua INCR+EXPIRE per-IP and per-username; account lockout after 10 failed logins (900 s TTL); uniform 401 prevents username enumeration
AI summary generateAiSummaryService accepts emotionData/emotionNames from browser; buildGroqPrompt annotates each Whisper segment with matched live facial/audio emotion via buildSpeakerLiveMap; returns discrepancies[] and live_dominant_emotion per speaker; Groq model llama-3.1-8b-instant; rate-limited 2× per 2 hours
RAG pipeline Transcripts chunked (segment-based or sliding-window, 600 tokens, 100 overlap) → Nomic nomic-embed-text-v1.5 embeddings cached in Redis (7-day TTL) → BullMQ background indexing → MongoDB $vectorSearch + MMR reranking (λ=0.6, top-5) → Groq llama-3.3-70b-versatile with 30-message session history; SSE streaming
Redis test suite 25 tests covering distributed cache, locks, rate limiting, pub/sub, batch ops, reconnection recovery; CI runs 20 via npm run test:redis:ci

🚀 Services

Service Runtime Hosted on Role
Frontend React SPA Render UI, WebRTC, emotion capture, chat, transcript viewer
Backend Node.js / Express + Socket.IO Render Signalling, auth, room management, transcript storage
Emotion Service Python / FastAPI + Socket.IO Azure Real-time multimodal emotion inference
Transcript Service Python / FastAPI Azure Post-meeting ASR, per-segment emotion, callback delivery

Transports

Transport Between Purpose
WebRTC Browser ↔ Browser Live audio/video — never proxied
Socket.IO / WS Frontend ↔ Backend SDP/ICE relay, chat, participant state, transcript-request notifications (transcript-request-received, transcript-request-update), emotion status relay (emotion-status)
Socket.IO / WS Frontend ↔ Emotion Service emotion.frame, audio_chunk, emotion.result
HTTP multipart Frontend → Transcript Service Audio blob upload after meeting ends
HTTP REST Frontend ↔ Backend Auth, rooms, transcripts, transcript-requests, meeting history, RAG Q&A

⚡ Running Locally

Quick start

chmod +x dev.sh   # one-time
./dev.sh          # starts all 4 services with colour-coded output
Prefix Service Port
FRONTEND React SPA 3000
BACKEND Node.js / Express 8000
EMOTION FastAPI emotion inference 5002
TRANSCRIPT FastAPI transcription 5001

Start MongoDB and Redis first. Python venvs must exist at emotion_service/venv and transcript_service/venvdev.sh invokes them directly via ./emotion_service/venv/bin/python and ./transcript_service/venv/bin/python. Ctrl+C sends SIGINT and kills all child processes cleanly.

Windows: dev.sh is a bash script. Use WSL2 (recommended), Git Bash, or start each service manually in four separate terminals — see docs/CONTRIBUTING.md for the PowerShell commands.

Step by step

1 — MongoDB + Redis

mongod --dbpath /data/db
redis-server

2 — Backend

cd backend && npm install
cp .env.example .env             # fill in JWT_SECRET, MONGO_URI, GROQ_API_KEY etc.
npm run dev                      # single-process dev (nodemon)
npm run prod                     # production: pm2 start ecosystem.config.cjs (3 processes)

Redis tests:

npm run test:redis      # 25 tests (kills + restarts local Redis)
npm run test:redis:ci   # 20 tests (no recovery tests — safe for CI)

3 — Emotion Service

cd emotion_service
python3.12 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
uvicorn app:app --host 0.0.0.0 --port 5002 --reload

models/ must contain best_modal.pt, xgb_model.joblib, weights.json, anomaly detectors, and embeddings/face_landmarker.task. The server refuses to start if any model fails to load.

4 — Transcript Service

cd transcript_service
python3.13 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
uvicorn app:app --host 0.0.0.0 --port 5001

ffmpeg must be in PATH — validated at startup. Whisper + DistilRoBERTa download from HuggingFace on first run. Do not use python app.py — invoke via uvicorn app:app directly. (dev.sh runs this without --reload; the emotion service uses --reload.)

5 — Frontend

cd frontend && npm install
npm start        # dev
npm run build    # production

🧩 Engineering Challenges

1 — Multi-process Socket.IO fan-out@socket.io/redis-adapter delivers events across all 3 pm2 instances via Redis pub/sub. All room state lives in Redis so any process can serve any client.

2 — Concurrent join races — A Redis distributed lock (SET NX PX 10000, Lua CAS release) serialises participant state mutations within a 10-second window per room.

3 — CPU-bound inference without blocking — The emotion service offloads PyTorch and MediaPipe to a thread-pool executor. Backpressure events throttle the client when the face queue depth hits 3.

4 — Async transcript delivery with no shared state — Services share no DB or queue. The transcript service delivers via HTTP POST callback. The frontend polls using exponential backoff (5 s → 10 s → 20 s → 40 s with ±20% jitter, repeating at 40 s up to a 10-minute cap) — fully decoupled.

5 — Parallel media capture in the browser — Host simultaneously captures frames for emotion, records audio for transcription, and plays WebRTC video via three independent tap points.

6 — Reconnect state gap — Backend reconstructs participant records from Redis on reconnect; the emotion service holds per-participant state in process memory. The two stores are not reconciled — stale buffers may persist after reconnect.


⚠️ Known Limitations

Area Limitation
BullMQ worker in-process The RAG indexing indexWorker runs in the same Node.js process as the HTTP/WebSocket server. Under sustained indexing load, event-loop contention may increase WebSocket and HTTP response latency because indexing workers are not isolated into a dedicated process.
Inference scaling Emotion service in-process state cannot be horizontally scaled without externalising to Redis. Transcript service model singletons have the same constraint.
Transcript delivery An empty merged-segment result causes a silent no-callback. A 4xx response from the backend also causes silent loss (only network errors and 5xx are retried).
NODE_API timeout requests.post(..., timeout=None) — a hung backend blocks the transcript background thread indefinitely across all retry attempts.
Cleanup timer cleanupOldMeetings runs in all 3 pm2 processes independently every hour — no distributed leader election.
Transcription language Whisper hardcoded to language="en" — multilingual meetings produce degraded output.
Orchestration No unified supervisor across 4 services. Only the emotion service exposes GET /health and GET /ready.
CORS Backend allows localhost:3000 + one CLIENT_ORIGIN. Additional origins require a code change.
Chat history Capped at 500 messages — no archival or export.
Inference concurrency Shared asr_model and emotion_pipeline singletons in the transcript service have no explicit locking; thread safety depends on upstream library internals.

📦 Dataset

The EmotionTransformer + XGBoost ensemble was trained on RAVDESS and CREMA-D datasets with actor-disjoint train/val/test splits (strict speaker-independent evaluation). Hyperparameters were tuned via Optuna separately for the Transformer and XGBoost models.

Download: dataset.npz — Google Drive

Place under emotion_service/extracted_dataset/ before running the training pipeline. Pre-trained model files under models/ are all that's needed to run the inference server. See docs/emotion-service.md for the full training procedure.


🤝 Contributing

See docs/CONTRIBUTING.md — covers prerequisites, local setup, env configuration, load testing, and PR checklist.


📚 Documentation

File Contents
docs/frontend.md Hook architecture, WebRTC lifecycle, emotion pipeline, event contracts, error handling
docs/backend.md Routes, Socket.IO handlers, Redis lock design, pm2 config, RAG pipeline, API contracts, security
docs/realTimeEmotionService.md Inference pipeline, model training, configuration schema, performance
docs/transcript_service.md ASR pipeline, segment merging, callback schema, error handling
docs/CONTRIBUTING.md Setup guide, prerequisites, contribution workflow

License

MIT — see LICENSE.



Built something cool with Hoovik? Open a PR — contributions are welcome.


Star this repo if it's useful — it helps more people find it.



About

Hoovik: Distributed meeting intelligence platform with WebRTC peer-to-peer video, real-time multimodal emotion inference, speaker-aware transcription, RAG-powered transcript search, AI-generated meeting insights, and Redis/BullMQ-backed background processing.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors