Production-grade Retrieval-Augmented Generation system with hybrid semantic + keyword search and re-ranking
Traditional keyword search fails when users phrase questions differently from the source text. Pure vector search misses exact term matches. Neither approach alone is reliable for document Q&A.
RAG Document Intelligence solves this with a hybrid retrieval pipeline that combines both strategies and fuses the results using Reciprocal Rank Fusion (RRF) β the same technique used by production search engines β to deliver consistently relevant answers grounded in your documents.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG Document Intelligence β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β π Ingestion β β π Query Pipeline β β
β β β β β β
β β PDF Upload/Dir β β User Question β β
β β β β β β β β
β β PyPDF Loader β β ββββββββββββ β β
β β β β β βΌ βΌ β β
β β Recursive Text β β βββββββββββ βββββββββββ β β
β β Splitter β β βSemantic β β Keyword β β β
β β (1000 char chunks β β β Search β β Search β β β
β β 200 overlap) β β β(ChromaDB)β β (BM25) β β β
β β β β β ββββββ¬βββββ ββββββ¬βββββ β β
β β OpenAI Embeddings β β β β β β
β β (text-embedding- β β βββββββ¬ββββββ β β
β β 3-small) β β βΌ β β
β β β β β Reciprocal Rank Fusion β β
β β ChromaDB + SHA-256 β β (k=60, per Cormack 2009) β β
β β Dedup Store β β β β β
β β β β Re-ranked Results β β
β βββββββββββββββββββββββ β β β β
β β GPT-4o-mini + Context β β
β β β β β
β β Answer + Source Citations β β
β ββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β π FastAPI REST API π Streamlit Frontend β β
β β POST /ask POST /upload Interactive Document Q&A β β
β β POST /search POST /ingest PDF Upload & Ingestion β β
β β GET /stats GET /health Source Attribution View β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Feature | Description |
|---|---|
| Hybrid Search | Combines dense vector retrieval (ChromaDB) with sparse BM25 keyword matching for robust recall |
| Reciprocal Rank Fusion | Merges ranked lists using RRF (1/(k+rank)) β outperforms single-strategy retrieval without tuning |
| Source Attribution | Every answer cites the exact source filename and page number |
| Content Deduplication | SHA-256 content hashing prevents duplicate embeddings across re-ingestions |
| REST API | Full FastAPI backend with OpenAPI docs, file upload, and typed request/response models |
| Interactive UI | Streamlit frontend for drag-and-drop PDF upload, Q&A, and retrieval-only search |
| CLI Ingestion | Batch-process entire directories of PDFs from the command line |
| Configurable | All parameters (chunk size, overlap, top-k, models) via environment variables |
| Strategy | Strength | Weakness |
|---|---|---|
| Semantic only | Understands meaning, synonyms, paraphrasing | Misses exact terms, acronyms, proper nouns |
| Keyword only | Precise term matching, fast | No understanding of meaning or context |
| Hybrid + RRF | Best of both β high recall and precision | Slightly more compute (negligible in practice) |
RRF is parameter-free (only k=60 constant) and consistently improves retrieval quality without requiring training data or score normalization across methods.
| Component | Technology |
|---|---|
| LLM | OpenAI GPT-4o-mini |
| Embeddings | OpenAI text-embedding-3-small (1536 dims) |
| Vector Store | ChromaDB (persistent, local) |
| Framework | LangChain 0.3 |
| API | FastAPI + Uvicorn |
| Frontend | Streamlit |
| Keyword Search | BM25Okapi (rank-bm25) |
| Re-ranking | Reciprocal Rank Fusion |
| PDF Processing | PyPDF |
| Config | Pydantic Settings |
git clone https://github.com/salehA13/rag-document-intelligence.git
cd rag-document-intelligence
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Add your OpenAI API key to .env# Ingest the sample PDFs
python ingest.py ./docs
# Ingest a single file
python ingest.py path/to/paper.pdf
# Check store stats
python ingest.py --stats .uvicorn src.api.server:app --reload --port 8000Interactive docs at: http://localhost:8000/docs
streamlit run src/frontend/app.pyHealth check.
{ "status": "healthy", "version": "1.0.0" }Vector store statistics.
{ "total_documents": 142, "persist_dir": "./data/chroma" }Full RAG pipeline β retrieve, re-rank, generate answer with citations.
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{
"question": "What is the self-attention mechanism?",
"top_k": 10,
"rerank_k": 5
}'{
"answer": "The self-attention mechanism allows each position in a sequence to attend to all other positions...",
"sources": [
{ "filename": "transformer_survey.pdf", "page": 3 },
{ "filename": "transformer_survey.pdf", "page": 7 }
],
"num_sources": 5
}Retrieval-only β returns ranked document chunks without LLM generation.
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"question": "attention mechanisms", "top_k": 10, "rerank_k": 5}'Upload and ingest a single PDF.
curl -X POST http://localhost:8000/upload \
-F "file=@paper.pdf"Batch-ingest all PDFs from a directory.
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"directory": "./docs"}'pytest tests/ -vAll 13 tests cover the ingestion pipeline, hybrid search logic, RRF correctness, and API endpoints:
tests/test_loader.py β chunking, metadata enrichment, edge cases
tests/test_search.py β BM25, RRF merging, tokenization
tests/test_api.py β health, stats, upload validation, error handling
rag-document-intelligence/
βββ src/
β βββ config.py # Centralized settings (pydantic-settings)
β βββ ingestion/
β β βββ loader.py # PDF loading & recursive text chunking
β β βββ embedder.py # ChromaDB vector store + SHA-256 dedup
β βββ search/
β β βββ hybrid.py # Hybrid search: semantic + BM25 + RRF
β β βββ qa.py # QA chain with source attribution
β βββ api/
β β βββ models.py # Typed Pydantic request/response schemas
β β βββ server.py # FastAPI application + CORS
β βββ frontend/
β βββ app.py # Streamlit interactive UI
βββ tests/
β βββ test_loader.py # Ingestion pipeline tests
β βββ test_search.py # Search & RRF tests
β βββ test_api.py # API endpoint tests
βββ docs/ # Sample PDFs for demo
βββ ingest.py # CLI ingestion tool
βββ requirements.txt
βββ .env.example
βββ .gitignore
βββ LICENSE
βββ README.md
All settings are managed via environment variables (.env file) with sensible defaults:
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
β | Your OpenAI API key (required) |
OPENAI_MODEL |
gpt-4o-mini |
LLM for answer generation |
EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
CHROMA_PERSIST_DIR |
./data/chroma |
ChromaDB storage path |
CHUNK_SIZE |
1000 |
Characters per chunk |
CHUNK_OVERLAP |
200 |
Overlap between chunks |
TOP_K |
10 |
Retrieval candidates |
RERANK_TOP_K |
5 |
Final results after RRF |
Built by Saleh