A Retrieval-Augmented Generation (RAG) chatbot that processes English PDF documents and answers user questions in either English or Hindi. The system uses intelligent query translation and language enforcement to provide accurate, grounded responses regardless of the query language.
- Multilingual Query Support: Ask questions in English or Hindi
- Language-Specific Responses: Choose your preferred response language (English/Hindi)
- PDF Document Ingestion: Upload and process multiple PDF files
- Smart Query Translation: Hindi queries are automatically translated to English for better retrieval
- Accurate Retrieval: Uses dense embeddings + cross-encoder reranking for precise context matching
- Source Citations: Every answer includes document references (filename, page number)
- Grounded Responses: System abstains when answers aren't found in the documents
- Persistent Index: Uploaded documents are indexed once and reused across sessions
User Query (EN/HI) → Language Detection → Query Translation (if needed) → Vector Retrieval → Reranking → LLM Generation → Language Enforcement → Response
-
PDF Processing Pipeline (
rag_utils.py)- Text extraction using
pypdf - Semantic chunking with overlap for context preservation
- Metadata tracking (filename, page, chunk index)
- Text extraction using
-
Multilingual Query Handling
- Language detection using
langdetect - Hindi→English translation using
deep-translator - English embeddings for consistent retrieval
- Language detection using
-
Retrieval System
- Dense embeddings:
sentence-transformers/all-MiniLM-L6-v2 - Vector similarity search with FAISS
- Cross-encoder reranking:
cross-encoder/ms-marco-MiniLM-L-6-v2
- Dense embeddings:
-
Generation & Language Control
- OpenAI GPT-4o-mini for answer generation
- Strict prompt engineering for language enforcement
- Post-generation language verification and correction
| Component | Technology | Why Chosen |
|---|---|---|
| UI Framework | Streamlit | Rapid prototyping, built-in file upload, minimal code |
| PDF Processing | pypdf + langchain | Reliable text extraction, metadata preservation |
| Text Chunking | RecursiveCharacterTextSplitter | Semantic-aware splitting, configurable overlap |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 | Fast, lightweight, good English performance |
| Vector Store | FAISS | High-performance similarity search, local deployment |
| Reranking | cross-encoder/ms-marco-MiniLM-L-6-v2 | Improved precision over pure embedding similarity |
| Translation | deep-translator | Reliable Google Translate API, no async issues |
| Language Detection | langdetect | Fast, accurate language identification |
| LLM | OpenAI GPT-4o-mini | Cost-effective, good multilingual capabilities |
Decision: Translate Hindi queries to English before retrieval
Rationale:
- English PDFs work best with English embeddings
- Avoids multilingual embedding complexity
- Maintains retrieval accuracy
- Simple implementation
Decision: Use all-MiniLM-L6-v2 instead of multilingual models
Rationale:
- Smaller model size (80MB vs 400MB+)
- Faster inference
- Better English performance
- Translation handles cross-language queries
Decision: Dense retrieval + cross-encoder reranking
Rationale:
- Dense retrieval: Fast, semantic understanding
- Reranking: Improved precision, better relevance scoring
- Best of both worlds for accuracy
Decision: Streamlit over FastAPI/React
Rationale:
- Rapid development
- Built-in components (file upload, chat interface)
- Easy deployment
- Focus on functionality over UI polish
- Python: 3.10 or higher
- Internet Connection: Required for model downloads and translation
- OpenAI API Key: For answer generation (get from OpenAI Platform)
- Memory: ~2GB RAM for model loading
- Storage: ~500MB for models and index files
# Create virtual environment
python -m venv .venv
# Activate environment
# Windows PowerShell:
.venv\Scripts\Activate.ps1
# Windows Command Prompt:
.venv\Scripts\activate.bat
# macOS/Linux:
source .venv/bin/activatepip install -r requirements.txtWhat gets installed:
streamlit: Web UI frameworksentence-transformers: Embedding modelsfaiss-cpu: Vector similarity searchdeep-translator: Hindi↔English translationlangdetect: Language identificationopenai: GPT API clientlangchain-community: PDF processing utilitiespypdf: PDF text extraction
Create `.env` file:
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-4o-mini
RAG_TOP_K=12
RAG_RERANK_K=6
SIMILARITY_ABSTAIN_THRESHOLD=0.25Configuration Parameters:
OPENAI_API_KEY: Your OpenAI API key (required)OPENAI_MODEL: GPT model to use (gpt-4o-mini recommended for cost/performance)RAG_TOP_K: Number of chunks to retrieve initially (12 = good recall)RAG_RERANK_K: Number of chunks after reranking (6 = focused context)SIMILARITY_ABSTAIN_THRESHOLD: Minimum similarity to provide answer (0.25 = conservative)
python test_models.pyThis verifies:
- Embedding model downloads correctly
- Reranker model loads
- FAISS vector operations work
- Translation functionality works
- All dependencies are properly installed
Testing model loading...
Loading embedding model...
✓ Embedding model loaded successfully
Loading reranker model...
✓ Reranker model loaded successfully
Testing FAISS...
✓ FAISS working correctly
Testing translation...
✓ Translation working: 'मशीन लर्निंग क्या है?' -> 'What is machine learning?'
🎉 All models loaded successfully! You can now run 'streamlit run app.py'
streamlit run app.pyOpen the displayed URL (typically http://localhost:8501)
- Upload PDFs: Click "Upload English PDFs" and select your documents
- Wait for Indexing: System processes and indexes your documents (one-time per file)
- Select Response Language: Choose English or Hindi for responses
- Ask Questions: Type questions in English or Hindi
- Review Answers: Get grounded responses with source citations
English Queries:
- "What is machine learning?"
- "How does neural network training work?"
- "What are the benefits of cloud computing?"
Hindi Queries:
- "मशीन लर्निंग क्या है?" (What is machine learning?)
- "न्यूरल नेटवर्क कैसे काम करता है?" (How do neural networks work?)
- "क्लाउड कंप्यूटिंग के फायदे क्या हैं?" (What are the benefits of cloud computing?)
- 📚 Indexed files: Shows which documents are currently processed
- 🔍 Detected Hindi query: Appears when system translates your query
- Sources: Expandable section showing document references
- Clear: Removes current answer
- Clear Index: Removes all indexed documents (forces re-upload)
PDF → Text Extraction → Chunking (2000 chars, 250 overlap) → Metadata TaggingWhy this approach:
- 2000 character chunks: Balance between context and precision
- 250 character overlap: Prevents information loss at boundaries
- Metadata tracking: Enables accurate source citations
User Query → Language Detection → Translation (if Hindi) → Embedding → Vector SearchProcess details:
- Detects query language using statistical models
- Translates Hindi to English for consistent retrieval
- Converts query to 384-dimensional vector
- Searches against indexed document vectors
Vector Search (top 12) → Cross-Encoder Reranking (top 6) → Context AssemblyTwo-stage approach:
- Stage 1: Fast vector similarity (semantic matching)
- Stage 2: Precise cross-encoder scoring (relevance ranking)
- Result: Most relevant 6 chunks for answer generation
Context + Query → GPT Prompt → Raw Answer → Language Enforcement → Final ResponseLanguage control:
- System prompt specifies target language
- Post-generation language verification
- Translation if language enforcement fails
For better recall (find more relevant content):
RAG_TOP_K=20
RAG_RERANK_K=8
SIMILARITY_ABSTAIN_THRESHOLD=0.20For better precision (more focused answers):
RAG_TOP_K=8
RAG_RERANK_K=4
SIMILARITY_ABSTAIN_THRESHOLD=0.35For better multilingual support:
Replace in rag_utils.py:
self.embed_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"For better English performance:
self.embed_model_name = "sentence-transformers/all-mpnet-base-v2"QA RAG/
├── app.py # Main Streamlit application
├── rag_utils.py # RAG pipeline implementation
├── test_models.py # Installation verification script
├── requirements.txt # Python dependencies
├── .env # Environment template
├── README.md # This file
├── uploads/ # Uploaded PDF storage
└── data/
└── index/ # Vector index and metadata storage
Models fail to download:
pip install --upgrade sentence-transformers transformers torch
pip install --no-cache-dir sentence-transformersTranslation errors:
- Ensure internet connection for Google Translate API
- Check if query is properly detected as Hindi
No answers found:
- Lower
SIMILARITY_ABSTAIN_THRESHOLDin.env - Try rephrasing your question
- Ensure PDFs contain relevant content
Out of memory:
- Reduce
RAG_TOP_KandRAG_RERANK_K - Use smaller embedding model
- Process fewer documents at once
streamlit run app.pyFROM python:3.10-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address", "0.0.0.0"]- Query processing: ~2-3 seconds
- Document indexing: ~1 second per page
- Memory usage: ~1.5GB with models loaded
- Storage: ~10MB per 100-page document
- Support for more languages (Spanish, French, etc.)
- Integration with local LLMs (Llama, Mistral)
- Advanced chunking strategies
- Conversation memory
- Batch query processing