Skip to content

salehA13/rag-document-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” RAG Document Intelligence

CI

Production-grade Retrieval-Augmented Generation system with hybrid semantic + keyword search and re-ranking

Python FastAPI LangChain ChromaDB License: MIT


The Problem

Traditional keyword search fails when users phrase questions differently from the source text. Pure vector search misses exact term matches. Neither approach alone is reliable for document Q&A.

RAG Document Intelligence solves this with a hybrid retrieval pipeline that combines both strategies and fuses the results using Reciprocal Rank Fusion (RRF) β€” the same technique used by production search engines β€” to deliver consistently relevant answers grounded in your documents.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RAG Document Intelligence                          β”‚
β”‚                                                                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚   πŸ“„ Ingestion       β”‚         β”‚   πŸ” Query Pipeline              β”‚  β”‚
β”‚   β”‚                      β”‚         β”‚                                  β”‚  β”‚
β”‚   β”‚   PDF Upload/Dir     β”‚         β”‚   User Question                  β”‚  β”‚
β”‚   β”‚        β”‚             β”‚         β”‚        β”‚                         β”‚  β”‚
β”‚   β”‚   PyPDF Loader       β”‚         β”‚        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚  β”‚
β”‚   β”‚        β”‚             β”‚         β”‚        β–Ό          β–Ό             β”‚  β”‚
β”‚   β”‚   Recursive Text     β”‚         β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚
β”‚   β”‚   Splitter           β”‚         β”‚   β”‚Semantic β”‚ β”‚ Keyword β”‚     β”‚  β”‚
β”‚   β”‚   (1000 char chunks  β”‚         β”‚   β”‚ Search  β”‚ β”‚ Search  β”‚     β”‚  β”‚
β”‚   β”‚    200 overlap)      β”‚         β”‚   β”‚(ChromaDB)β”‚ β”‚ (BM25)  β”‚     β”‚  β”‚
β”‚   β”‚        β”‚             β”‚         β”‚   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜     β”‚  β”‚
β”‚   β”‚   OpenAI Embeddings  β”‚         β”‚        β”‚           β”‚           β”‚  β”‚
β”‚   β”‚   (text-embedding-   β”‚         β”‚        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β”‚  β”‚
β”‚   β”‚    3-small)          β”‚         β”‚              β–Ό                  β”‚  β”‚
β”‚   β”‚        β”‚             β”‚         β”‚   Reciprocal Rank Fusion        β”‚  β”‚
β”‚   β”‚   ChromaDB + SHA-256 β”‚         β”‚   (k=60, per Cormack 2009)     β”‚  β”‚
β”‚   β”‚   Dedup Store        β”‚         β”‚              β”‚                  β”‚  β”‚
β”‚   β”‚                      β”‚         β”‚       Re-ranked Results         β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚              β”‚                  β”‚  β”‚
β”‚                                    β”‚       GPT-4o-mini + Context     β”‚  β”‚
β”‚                                    β”‚              β”‚                  β”‚  β”‚
β”‚                                    β”‚    Answer + Source Citations     β”‚  β”‚
β”‚                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚  🌐 FastAPI REST API              πŸ“Š Streamlit Frontend         β”‚  β”‚
β”‚   β”‚  POST /ask    POST /upload        Interactive Document Q&A      β”‚  β”‚
β”‚   β”‚  POST /search POST /ingest        PDF Upload & Ingestion        β”‚  β”‚
β”‚   β”‚  GET  /stats  GET  /health        Source Attribution View       β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features

Feature Description
Hybrid Search Combines dense vector retrieval (ChromaDB) with sparse BM25 keyword matching for robust recall
Reciprocal Rank Fusion Merges ranked lists using RRF (1/(k+rank)) β€” outperforms single-strategy retrieval without tuning
Source Attribution Every answer cites the exact source filename and page number
Content Deduplication SHA-256 content hashing prevents duplicate embeddings across re-ingestions
REST API Full FastAPI backend with OpenAPI docs, file upload, and typed request/response models
Interactive UI Streamlit frontend for drag-and-drop PDF upload, Q&A, and retrieval-only search
CLI Ingestion Batch-process entire directories of PDFs from the command line
Configurable All parameters (chunk size, overlap, top-k, models) via environment variables

Why Hybrid Search + RRF?

Strategy Strength Weakness
Semantic only Understands meaning, synonyms, paraphrasing Misses exact terms, acronyms, proper nouns
Keyword only Precise term matching, fast No understanding of meaning or context
Hybrid + RRF Best of both β€” high recall and precision Slightly more compute (negligible in practice)

RRF is parameter-free (only k=60 constant) and consistently improves retrieval quality without requiring training data or score normalization across methods.


Tech Stack

Component Technology
LLM OpenAI GPT-4o-mini
Embeddings OpenAI text-embedding-3-small (1536 dims)
Vector Store ChromaDB (persistent, local)
Framework LangChain 0.3
API FastAPI + Uvicorn
Frontend Streamlit
Keyword Search BM25Okapi (rank-bm25)
Re-ranking Reciprocal Rank Fusion
PDF Processing PyPDF
Config Pydantic Settings

Quick Start

1. Clone & Install

git clone https://github.com/salehA13/rag-document-intelligence.git
cd rag-document-intelligence

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

2. Configure

cp .env.example .env
# Add your OpenAI API key to .env

3. Ingest Documents

# Ingest the sample PDFs
python ingest.py ./docs

# Ingest a single file
python ingest.py path/to/paper.pdf

# Check store stats
python ingest.py --stats .

4. Start the API

uvicorn src.api.server:app --reload --port 8000

Interactive docs at: http://localhost:8000/docs

5. Launch the UI (optional)

streamlit run src/frontend/app.py

API Reference

GET /health

Health check.

{ "status": "healthy", "version": "1.0.0" }

GET /stats

Vector store statistics.

{ "total_documents": 142, "persist_dir": "./data/chroma" }

POST /ask

Full RAG pipeline β€” retrieve, re-rank, generate answer with citations.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the self-attention mechanism?",
    "top_k": 10,
    "rerank_k": 5
  }'
{
  "answer": "The self-attention mechanism allows each position in a sequence to attend to all other positions...",
  "sources": [
    { "filename": "transformer_survey.pdf", "page": 3 },
    { "filename": "transformer_survey.pdf", "page": 7 }
  ],
  "num_sources": 5
}

POST /search

Retrieval-only β€” returns ranked document chunks without LLM generation.

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"question": "attention mechanisms", "top_k": 10, "rerank_k": 5}'

POST /upload

Upload and ingest a single PDF.

curl -X POST http://localhost:8000/upload \
  -F "file=@paper.pdf"

POST /ingest

Batch-ingest all PDFs from a directory.

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"directory": "./docs"}'

Testing

pytest tests/ -v

All 13 tests cover the ingestion pipeline, hybrid search logic, RRF correctness, and API endpoints:

tests/test_loader.py   β€” chunking, metadata enrichment, edge cases
tests/test_search.py   β€” BM25, RRF merging, tokenization
tests/test_api.py      β€” health, stats, upload validation, error handling

Project Structure

rag-document-intelligence/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py                  # Centralized settings (pydantic-settings)
β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”œβ”€β”€ loader.py              # PDF loading & recursive text chunking
β”‚   β”‚   └── embedder.py            # ChromaDB vector store + SHA-256 dedup
β”‚   β”œβ”€β”€ search/
β”‚   β”‚   β”œβ”€β”€ hybrid.py              # Hybrid search: semantic + BM25 + RRF
β”‚   β”‚   └── qa.py                  # QA chain with source attribution
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ models.py              # Typed Pydantic request/response schemas
β”‚   β”‚   └── server.py              # FastAPI application + CORS
β”‚   └── frontend/
β”‚       └── app.py                 # Streamlit interactive UI
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_loader.py             # Ingestion pipeline tests
β”‚   β”œβ”€β”€ test_search.py             # Search & RRF tests
β”‚   └── test_api.py                # API endpoint tests
β”œβ”€β”€ docs/                          # Sample PDFs for demo
β”œβ”€β”€ ingest.py                      # CLI ingestion tool
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
└── README.md

Configuration

All settings are managed via environment variables (.env file) with sensible defaults:

Variable Default Description
OPENAI_API_KEY β€” Your OpenAI API key (required)
OPENAI_MODEL gpt-4o-mini LLM for answer generation
EMBEDDING_MODEL text-embedding-3-small Embedding model
CHROMA_PERSIST_DIR ./data/chroma ChromaDB storage path
CHUNK_SIZE 1000 Characters per chunk
CHUNK_OVERLAP 200 Overlap between chunks
TOP_K 10 Retrieval candidates
RERANK_TOP_K 5 Final results after RRF

License

MIT


Built by Saleh

About

Production-grade RAG system with hybrid semantic + keyword search, RRF re-ranking, FastAPI backend, and Streamlit UI

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages