ChatVid - Memvid Dataset Management CLI

A complete command-line tool for managing document datasets with AI-powered embeddings and interactive chat.

Powered by Memvid v1 - Turn millions of text chunks into a single, searchable video file. 🎬

Features

✅ Self-contained virtual environment management
✅ Automatic dependency installation
✅ Multiple dataset support
✅ 11 file format support: Documents (PDF, DOCX, RTF, TXT, MD), Spreadsheets (XLSX, XLS, CSV), Presentations (PPTX), E-books (EPUB), Web (HTML)
✅ Automatic embedding generation with source attribution
✅ Interactive AI chat with your documents
✅ Dataset versioning (append/rebuild)
✅ Simple, intuitive CLI with interactive menu mode
✅ Improved context retrieval (10 chunks, up from 5)
✅ Source file tracking to prevent data mixing
✅ Full environment variable configuration (chunk size, LLM model, temperature, etc.)
✅ NEW: Interactive menu system with numbered selection
✅ NEW: File management interface
✅ NEW: Comprehensive help system

Recent Updates

[1.6.0] - 2025-11-10

Major improvement to text chunking quality with sentence-boundary-aware splitting:

Semantic Chunker (`chatvid/chunking.py`)

SemanticChunker class: Respects sentence boundaries instead of arbitrary character splits
- Benefits: +25% quality improvement over fixed chunking
- Preserves complete sentences (no mid-sentence splits)
- Better semantic coherence per chunk
- Smart sentence-based overlap for continuity
- Size range control: min_chunk_size (300) to max_chunk_size (700)

Full version history: See CHANGELOG.md

Supported File Formats

ChatVid supports 11 file formats across 5 categories:

Category	Formats	Extensions	Features
Documents	PDF, Word, RTF, Text, Markdown	`.pdf`, `.docx`, `.doc`, `.rtf`, `.txt`, `.md`, `.markdown`	Full text extraction, metadata
Spreadsheets	Excel, CSV	`.xlsx`, `.xls`, `.csv`	Multi-sheet support, markdown tables, configurable row limits
Presentations	PowerPoint	`.pptx`	Slides, speaker notes, tables
E-books	EPUB	`.epub`	EPUB2/EPUB3, chapter extraction, metadata
Web Content	HTML	`.html`, `.htm`	Clean text extraction, script/style removal

Format-Specific Notes

Spreadsheets: Configurable row limit via MAX_SPREADSHEET_ROWS (default: 10,000) prevents memory issues
PowerPoint: Only .pptx supported (not legacy .ppt format)
EPUB: DRM-protected files not supported
All Formats: Source attribution included automatically for accurate LLM responses

System Requirements

Python Version

ChatVid requires Python 3.10, 3.11, 3.12, or 3.13.

The CLI script automatically detects and validates your Python installation:

First run: Detects suitable Python command (python3 or python) and saves it
Subsequent runs: Uses saved command for fast startup (~10ms overhead)
Auto-recovery: Re-detects if saved command becomes invalid
Smart detection: Prefers python3 over python for better cross-platform compatibility

Manual Override

If you need to use a specific Python executable (e.g., for pyenv/asdf users):

export CHATVID_PYTHON_CMD=/path/to/python3.12
./cli.sh

Checking Your Python Version

python3 --version  # Should show 3.10.x, 3.11.x, 3.12.x, or 3.13.x
# OR
python --version   # If python3 is not available

Installing Python

If you don't have a compatible Python version:

macOS: brew install python@3.12
Ubuntu/Debian: sudo apt install python3.12
Windows: winget install Python.Python.3.12
Any OS: Download from python.org
Version managers: Use pyenv, asdf, or similar tools

Quick Start

New in v1.2.0: Interactive menu mode! Simply run ./cli.sh and follow the numbered menus - no commands to memorize!

Interactive Mode (Recommended for Beginners)

cd ChatVid
./cli.sh

The interactive menu will guide you through:

Setup - Configure your API key
Create Dataset - Name your dataset
Build - Select dataset and process documents
Chat - Select dataset and start asking questions
File Management - View and manage files
Help - Comprehensive documentation

Benefits:

No command memorization needed
Numbered selection (just type 1, 2, 3, etc.)
Visual dataset status indicators
Guided workflows with validation
Built-in help and troubleshooting

Command-Line Mode (For Advanced Users)

1. First-Time Setup

cd ChatVid
./cli.sh setup

This will:

Create a virtual environment (venv/)
Install all dependencies
Prompt for your OpenAI or OpenRouter API key

2. Create a Dataset

./cli.sh create my-project

This creates:

datasets/my-project/
├── documents/     # Add your files here
└── metadata.json  # Dataset tracking

3. Add Your Documents

# Copy your files to the documents folder
cp ~/my-documents/*.pdf datasets/my-project/documents/
cp ~/my-documents/*.txt datasets/my-project/documents/

Supported formats:

PDF (.pdf)
Text (.txt)
Markdown (.md)
Word (.docx, .doc)
HTML (.html, .htm)

4. Build Embeddings

./cli.sh build my-project

This will:

Extract text from all documents
Add source attribution to prevent data mixing
Generate semantic embeddings
Create searchable knowledge base (knowledge.mp4)

5. Start Chatting!

./cli.sh chat my-project

Ask questions about your documents and get AI-powered answers! The AI now correctly distinguishes between different source files.

Complete Command Reference

Interactive Menu Mode

`./cli.sh` (no arguments)

Start interactive menu with numbered selection

./cli.sh

Menu Options:

Setup / Configure API
Create New Dataset
Build Dataset (Process Documents)
Chat with Dataset
Append Documents to Dataset
Rebuild Dataset
List All Datasets
Dataset Info
Manage Dataset Files - NEW!
Delete Dataset
Help & Documentation - NEW!
Exit

Features:

Dataset selection from numbered list
File management (view, remove, open folder)
Built-in help and tutorials
Progress tracking and validation

Setup & Configuration

`./cli.sh setup`

Configure your API key (first-time setup)

./cli.sh setup

Prompts for:

OpenAI API key (https://platform.openai.com/api-keys)
OpenRouter API key (https://openrouter.ai/keys)

Saves configuration to .env file.

`./cli.sh help`

Show comprehensive help documentation - NEW!

./cli.sh help

Displays:

Command reference with examples
Configuration variable guide
Workflow tutorials
Troubleshooting tips
Configuration presets

Dataset Management

`./cli.sh create <name>`

Create a new dataset

./cli.sh create research-papers

Creates folder structure at datasets/research-papers/

`./cli.sh list`

List all datasets with statistics

./cli.sh list

Shows:

Dataset names
Creation dates
Build status
Number of chunks and files

`./cli.sh info <name>`

Show detailed dataset information

./cli.sh info research-papers

Displays:

Document list with sizes
Build timestamps
Embedding statistics
File paths

`./cli.sh delete <name>`

Delete a dataset

./cli.sh delete old-project

Requires confirmation by typing the dataset name.

Building Embeddings

`./cli.sh build <name>`

Build embeddings from documents

./cli.sh build research-papers

Processes all files in datasets/<name>/documents/ and creates:

knowledge.mp4 - QR code video with embeddings
knowledge_index.json - Metadata index
knowledge_index.faiss - Vector search index

`./cli.sh append <name>`

Add new documents to existing dataset

# 1. Add new files to documents/
cp new-file.pdf datasets/research-papers/documents/

# 2. Append to embeddings
./cli.sh append research-papers

Note: Currently rebuilds entire dataset (Memvid limitation)

`./cli.sh rebuild <name>`

Rebuild embeddings from scratch

./cli.sh rebuild research-papers

Deletes existing embeddings and rebuilds from all documents.

When to rebuild:

After updating to v1.0.2 (adds source attribution)
When chat responses seem inaccurate
After changing chunk size settings

Interactive Chat

`./cli.sh chat <name>`

Start interactive chat session

./cli.sh chat research-papers

Features:

Context-aware responses with 10-chunk retrieval window
Semantic search across all documents
Source attribution prevents data mixing
Conversation history
Type quit or exit to end

Example session:

You: What did Company A offer in their proposal?
Assistant: [Source: Proposal_CompanyA_3.11.pdf]
Based on the Company A proposal, they offered...

You: What about B pricing?
Assistant: [Source: Proposal_CompanyB_3.11.pdf]
According to the Company B proposal, their pricing structure...

You: quit

Note: The chat now correctly distinguishes between different source files!

Complete Workflow Example

Example: Research Paper Analysis

# 1. Setup (first time only)
./cli.sh setup
# Enter your OpenAI API key

# 2. Create dataset
./cli.sh create quantum-research

# 3. Add research papers
cp ~/Downloads/quantum-*.pdf datasets/quantum-research/documents/

# 4. Build embeddings
./cli.sh build quantum-research
# Processing 15 documents...
# Build complete!

# 5. Start chatting
./cli.sh chat quantum-research
You: What are the key breakthroughs in quantum computing?
Assistant: The research papers highlight several key breakthroughs...

# 6. Later: Add more papers
cp new-paper.pdf datasets/quantum-research/documents/
./cli.sh append quantum-research

# 7. Chat with updated knowledge
./cli.sh chat quantum-research

Project Structure

ChatVid/
├── cli.sh                      # Main CLI entry point
├── memvid_cli.py              # Python implementation
├── requirements.txt           # Python dependencies
├── .env.example              # API key template
├── .env                      # Your API key (created by setup)
├── README.md                 # This file
├── venv/                     # Virtual environment (auto-created)
└── datasets/                 # All your datasets
    ├── research-papers/
    │   ├── documents/        # Your PDF, TXT, MD files
    │   ├── metadata.json     # Dataset tracking
    │   ├── knowledge.mp4     # Embeddings (QR video)
    │   ├── knowledge_index.json
    │   └── knowledge_index.faiss
    └── meeting-notes/
        └── documents/

API Keys

OpenAI

Get key: https://platform.openai.com/api-keys
Run ./cli.sh setup
Choose option 1 (OpenAI)
Enter your key: sk-...

OpenRouter

Get key: https://openrouter.ai/keys
Run ./cli.sh setup
Choose option 2 (OpenRouter)
Enter your key: sk-or-v1-...

Or manually edit .env file:

# For OpenAI
OPENAI_API_KEY=sk-your-key

# For OpenRouter
OPENAI_API_BASE=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-v1-your-key

Configuration

ChatVid is fully configurable via environment variables in the .env file.

Available Settings

Chunking Configuration (Build Phase)

Variable	Range	Default	Description
`CHUNK_SIZE`	100-1000	300	Size of text chunks in characters
`CHUNK_OVERLAP`	20-200	50	Overlap between consecutive chunks

Example: For technical documents with complex topics:

CHUNK_SIZE=400
CHUNK_OVERLAP=80

LLM Configuration (Chat Phase)

Variable	Range	Default	Description
`LLM_MODEL`	-	gpt-4o-mini-2024-07-18 (OpenAI) openai/gpt-4o (OpenRouter)	Model to use for chat
`LLM_TEMPERATURE`	0.0-2.0	0.7	Response creativity level
`LLM_MAX_TOKENS`	100-4000	1000	Maximum response length
`CONTEXT_CHUNKS`	1-20	10	Chunks retrieved per query
`MAX_HISTORY`	1-50	10	Conversation turns remembered

Note: Setup command automatically uses the correct model based on provider choice.

Example: For cost optimization:

LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=500
CONTEXT_CHUNKS=7
MAX_HISTORY=5

Example: For maximum quality:

LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000
CONTEXT_CHUNKS=15
MAX_HISTORY=20

Configuration Presets

For Technical Documentation

CHUNK_SIZE=400
CHUNK_OVERLAP=80
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.3
CONTEXT_CHUNKS=12

For Creative Content

CHUNK_SIZE=300
CHUNK_OVERLAP=50
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=1.0
CONTEXT_CHUNKS=10

For Cost Optimization

CHUNK_SIZE=300
CHUNK_OVERLAP=40
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_MAX_TOKENS=500
CONTEXT_CHUNKS=7

How to Configure

During setup (recommended):
```
./cli.sh setup
```
Creates .env with all default values
Manual editing:
```
nano .env  # or use any text editor
```
Per-project configuration:
- Copy ChatVid to different directories
- Each directory can have its own .env file
- Different settings for different use cases

When to Rebuild

After changing chunking settings (CHUNK_SIZE, CHUNK_OVERLAP), rebuild your datasets:

./cli.sh rebuild <dataset-name>

LLM settings (LLM_MODEL, LLM_TEMPERATURE, etc.) take effect immediately, no rebuild needed.

File Format Support

Format	Extension	Support	Notes
PDF	`.pdf`	✅ Full	Via PyPDF2
Text	`.txt`	✅ Full	Plain text
Markdown	`.md`	✅ Full	Plain text
Word	`.docx`	✅ Full	Via python-docx
Word (old)	`.doc`	⚠️ Limited	May require conversion
HTML	`.html`, `.htm`	✅ Full	Via BeautifulSoup4, strips tags/scripts

Troubleshooting

"Module not found" errors

# Reinstall dependencies
./cli.sh setup

"API key not configured"

# Run setup to configure
./cli.sh setup

# Or check .env file exists and has OPENAI_API_KEY=...
cat .env

"No documents found"

# Make sure files are in the right place
ls datasets/my-project/documents/

# Supported extensions: .pdf, .txt, .md, .html

"Embeddings not built"

# Build embeddings first
./cli.sh build my-project

# Then try chat again
./cli.sh chat my-project

Chat mixing up data from different files

Solution: Rebuild your dataset to add source attribution (v1.0.2+)

./cli.sh rebuild my-project

Advanced Usage

Custom Chunk Sizes

Adjust document chunk sizes in the .env file:

CHUNK_SIZE=1000
CHUNK_OVERLAP=200

After modifying these values, rebuild datasets:

./cli.sh rebuild <dataset_name>

Model Configuration

Edit .env to define API source and model:

OPENAI_API_KEY=sk-your-key

For OpenRouter integration:

OPENAI_API_BASE=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-v1-your-key
LLM_MODEL=openai/gpt-4o

Examples of alternative models:

LLM_MODEL=anthropic/claude-4.5-haiku
LLM_MODEL=google/gemini-pro-2.5

Multiple Projects

Each dataset is independent:

./cli.sh create work-docs
./cli.sh create personal-notes
./cli.sh create research-papers

# Each has its own embeddings and chat
./cli.sh chat work-docs
./cli.sh chat personal-notes

How It Works

Text Extraction: Reads PDF/TXT/MD/HTML files and extracts text
Source Attribution: Prepends [Source: filename.pdf] to each document (v1.0.2+)
Chunking: Splits text into overlapping chunks (~300 chars with 50 char overlap)
Embeddings: Generates semantic vectors using sentence-transformers (all-MiniLM-L6-v2)
QR Encoding: Encodes chunks as QR codes
Video Creation: Creates MP4 video where each frame is a QR code
Vector Index: Builds FAISS index for fast similarity search
Chat: Retrieves 10 most relevant chunks and sends to LLM for contextual answers

Performance

Build Times (approximate)

10 pages: ~30 seconds
100 pages: ~3 minutes
1000 pages: ~20 minutes

Chat Response Times

Search: <2 seconds for 1M chunks
LLM response: 2-5 seconds (depends on model)

Storage

10K chunks: ~20MB video + ~15MB index
Text compression: ~10:1 ratio

Tips & Best Practices

Organize by topic: Create separate datasets for different subjects
Rebuild after updates: Run ./cli.sh rebuild <name> after updating ChatVid
Clean documents: Remove headers/footers for better results
Chunk size: Use larger chunks (400-500) for technical docs, smaller (200-300) for mixed content
API costs: Use GPT-4o-mini (current default) for cost efficiency
Backup: Keep original documents separate from datasets/
File naming: Use descriptive filenames - they appear in source attribution
Context window: 10 chunks is optimal; increase in memvid_cli.py line 527 if needed

Limitations & Known Issues

Append: Currently rebuilds entire dataset (Memvid API limitation)
Binary files: Only text-based formats supported
OCR: Scanned PDFs require pre-processing with Tesseract (see TODO.md)
Large files: Very large PDFs (>100MB) may be slow to process
API costs: Chat requires API key with credits
Metadata: memvid doesn't support chunk metadata - workaround: source attribution in text
Page numbers: Not yet tracked for PDFs (planned in TODO.md)

Dependencies

All automatically installed by ./cli.sh setup:

Core:

memvid>=0.1.3 - Memvid v1 - Core embedding storage and semantic search engine

Document Processing:

PyPDF2>=3.0.1 - PDF text extraction
python-docx>=0.8.11 - Word document support
beautifulsoup4>=4.12.0 - HTML/web content parsing
lxml>=4.9.0 - HTML parser backend

API & Configuration:

openai>=1.0.0 - LLM integration (OpenAI and OpenRouter)
python-dotenv>=1.0.0 - Environment variable management

Support

Get API Keys:

OpenAI: https://platform.openai.com/api-keys
OpenRouter: https://openrouter.ai/keys

Memvid Documentation:

GitHub: https://github.com/olow304/memvid

Issues:

Check ./cli.sh list to verify datasets
Check ./cli.sh info <name> for details
Ensure API key is set in .env
Run ./cli.sh setup to reconfigure

Built With Memvid

ChatVid is powered by Memvid v1 - an innovative library that turns millions of text chunks into a single, searchable video file.

What is Memvid?

Memvid compresses an entire knowledge base into MP4 files while keeping millisecond-level semantic search. Think of it as SQLite for AI memory - portable, efficient, and self-contained.

Key Features:

📦 50-100× smaller storage than traditional vector databases
🎬 Encodes text as QR codes in video frames
🚀 Zero infrastructure required - just video files
🔍 Millisecond-level semantic search
💾 Portable and self-contained

Learn more:

GitHub: https://github.com/Olow304/memvid
PyPI: https://pypi.org/project/memvid/
License: MIT
Author: Olow304

Why Memvid?

ChatVid leverages Memvid's unique approach to storing embeddings as video files, making your document knowledge bases portable, efficient, and requiring zero database infrastructure. The entire dataset fits in a single .mp4 file!

Acknowledgments

Memvid v1 by Olow304 - The core technology that makes ChatVid possible
OpenAI - API for embeddings and chat completions
OpenRouter - Alternative API provider supporting multiple models

License

MIT License

This project is built upon and complies with the MIT License of the Memvid library:

See LICENSE for full details.

Quick Reference Card

# Interactive Menu (Recommended for beginners)
./cli.sh                       # Start interactive menu - NEW in v1.2.0!

# Help & Documentation
./cli.sh help                  # Comprehensive help - NEW!
./cli.sh --help                # Quick command reference

# Setup
./cli.sh setup                 # First-time configuration

# Datasets
./cli.sh create <name>         # Create new
./cli.sh list                  # Show all
./cli.sh info <name>           # Details
./cli.sh delete <name>         # Remove

# Documents
# → Add files to: datasets/<name>/documents/

# Embeddings
./cli.sh build <name>          # Initial build
./cli.sh append <name>         # Add more docs
./cli.sh rebuild <name>        # Start fresh

# Chat
./cli.sh chat <name>           # Interactive Q&A

# Configuration
# → Edit .env to change: chunk size, model, temperature, etc.

Ready to get started?

Beginners: Run ./cli.sh for interactive menu 🎯
Advanced: Run ./cli.sh setup for command-line mode 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
chatvid		chatvid
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
cli.sh		cli.sh
memvid_cli.py		memvid_cli.py
requirements.txt		requirements.txt