Extract transcriptions, visual descriptions, and smart summaries from videos. Run 100% locally (Whisper + BLIP + Ollama) or via APIs (Groq + Gemini). Designed for long clips, block-by-block summaries, and a customizable final overview.
klip2x.mp4
- 🎙️ Audio transcription in blocks (FFmpeg + local Whisper or Groq Whisper).
- 🖼️ Visual description of representative frames (local BLIP or Gemini Vision).
- 🧠 Multimodal summarization (combines speech + visuals) with configurable size, language, and persona.
- 🧩 Two execution modes:
- Local: no API keys required (faster-whisper + BLIP + Ollama).
- API: Groq (STT + LLM) + Google Gemini (image description).
- 🧱 Block processing (
BLOCK_DURATION
) with an aggregated final summary. - 🌐 Accepts local file or URL (via
utils/download_url.py
).
video-analysis/
├─ api-models/
│ └─ main.py # Pipeline using Groq + Gemini
├─ local-models/
│ └─ main.py # Pipeline using Whisper/BLIP + Ollama
├─ utils/
│ ├─ __init__.py
│ └─ download_url.py # Download from URLs (yt-dlp)
├─ downloads/ # Downloaded videos / temp artifacts
└─ .gitignore
Note: folder/file names may vary. Keep them as above to follow this guide verbatim.
- Python 3.10+
- FFmpeg (required to extract audio)
- OpenCV, Pillow
- Windows (winget):
winget install Gyan.FFmpeg
- Windows (choco):
choco install ffmpeg
- macOS (brew):
brew install ffmpeg
- Ubuntu/Debian:
sudo apt update && sudo apt install -y ffmpeg
Verify:
ffmpeg -version
git clone https://github.com/Ga0512/video-analysis.git
cd video-analysis
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
python -m pip install -U pip
pip install opencv-python pillow python-dotenv yt-dlp
Install API-specific deps:
pip install groq google-genai
Create .env
at the project root:
GROQ_API_KEY=put_your_groq_key_here
GOOGLE_API_KEY=put_your_gemini_key_here
Set the video:
- Edit
VIDEO_PATH
inapi-models/main.py
to a local file or a URL (YouTube/Instagram/etc.).
If it’s a URL, the script downloads it automatically viautils/download_url.py
.
(Optional) Tune parameters:
At the end of api-models/main.py
:
BLOCK_DURATION = 30 # seconds per block
LANGUAGE = "english" # "auto-detect" | "portuguese" | ...
SIZE = "large" # "short" | "medium" | "large"
PERSONA = "Expert"
EXTRA_PROMPTS = "Write the summary as key bullet points."
Run (from the repo root):
python -m api-models.main
Important: run from the repo root so
from utils.download_url import download
resolves correctly.
Install local-specific deps:
pip install faster-whisper transformers
# PyTorch - pick your variant (CPU/CUDA) for your machine
# Example (CPU):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Prepare Ollama & an LLM:
# Install Ollama on your system, then:
ollama pull llama3.2:3b
Set the video:
- Edit
VIDEO_PATH
inlocal-models/main.py
to a local file or a URL.
If it’s a URL, the script downloads it automatically viautils/download_url.py
.
(Optional) Tune parameters:
At the end of local-models/main.py
:
BLOCK_DURATION = 30
LANGUAGE = "english"
SIZE = "large"
PERSONA = "Funny"
EXTRA_PROMPTS = "Write the summary as key bullet points."
(Optional) Enable GPU for Whisper:
In initialize_models()
:
WhisperModel("medium", device="cuda", compute_type="float16")
Run (from the repo root):
python -m local-models.main "video.mp4" --block_duration 30 --language english --size medium --persona Expert --extra_prompts "Do it in key points"
# API mode
GROQ_API_KEY=...
GOOGLE_API_KEY=...
# Optional
# HTTP(S)_PROXY=http://...
# CUDA_VISIBLE_DEVICES=0
Never commit your
.env
to Git.
- Downloads videos from URLs (YouTube, Instagram, etc.) using yt-dlp.
- Saves to
downloads/
and returns the local path to feed the pipeline. - If you need guaranteed MP4 with AAC audio, adjust yt-dlp/ffmpeg options there.
For each block:
start_time
,end_time
transcription
(speech for the segment)frame_description
(visual description of the frame)audio_summary
(multimodal summary for the block)
Final:
- Final video summary (aggregates all blocks).
Currently printed to the terminal. It’s straightforward to extend to JSON, SRT, or Markdown exports.
- Function signatures & param order: ensure calls to
final_video_summary(...)
match the function signature (API and Local). - Image MIME for Gemini: if you saved PNG, pass
mime_type='image/png'
. - Audio in Opus (Windows): if needed, re-encode to AAC with FFmpeg:
ffmpeg -i input.ext -c:v libx264 -c:a aac -movflags +faststart output.mp4
ModuleNotFoundError: No module named 'utils'
: run scripts from the repo root and ensureutils/__init__.py
exists.
- GPU recommended:
WhisperModel(..., device="cuda", compute_type="float16")
. - Adjust
BLOCK_DURATION
(shorter = finer captions; longer = faster processing). - Tune
SIZE_TO_TOKENS
according to your LLM. - For longer videos, cache per-block results to safely resume.
- Export JSON/SRT/Markdown (per block and final).
- CLI:
klipmind --video <path|url> --mode api|local --lang en --size large ...
- Web UI (FastAPI/Streamlit) with upload/URL and progress bar.
- Multi-frame sampling per block.
- Model selection (Whisper tiny/base/…; BLIP variants; different LLMs).
- Unit tests for
utils/download_url.py
and parsers.
Contributions are welcome!
Open an issue with suggestions/bugs or submit a PR explaining the change.
MIT
- Whisper (faster-whisper), BLIP (Salesforce), Ollama (local models)
- Groq (STT + Chat Completions-compatible LLM)
- Gemini 2.0 Flash-Lite for vision (frame description)
- FFmpeg, OpenCV, Pillow