The map of mailrag's docs. Each doc has one job; read them in the order that
matches how deep you want to go.
README.md(repo root) — start here. Whatmailragis, the one-command quickstart (make demo), the architecture sketch, and the case study (cleanup economics + the measured retrieval ladder).QUICKSTART.md— the 5-minute path and copy-paste usage patterns against a built collection.SETUP.md— full setup: the two conda environments, Qdrant, configuration, the local.emlpipeline (Pass-1 filter → LLM Pass-2 → build), and how to run the tests.- Deep dives (read any, in any order):
ARCHITECTURE.md— design decisions and extension points.EMAIL_PREPROCESSING.md— reply-chain stripping, posting styles, and chunk-size tuning.RETRIEVAL_GUIDE.md— the retrieval stack end-to-end: dense vs learned-sparse, hybrid + RRF, reranking, and thread-aware retrieval (small→big expansion).EXPERIMENTS.md— the measured findings: the cleanup funnel, the labeled-eval ladder (§9–§13), the corpus-portability result (§14), confound controls, and the negative results. Its terminology box defines theC/C′collection labels and the two senses of "thread-aware".
BACKENDS.md— pointing mailrag at the LLM / embedder / vector store of your choice (LM Studio, Ollama, vLLM, NVIDIA NIM, OpenAI, Qdrant): theRAG_*variables, per-backend examples, and the dense-only "sparse caveat".CLOUD_STORAGE_SETUP.md— Azure Blob Storage + Qdrant Cloud (Pinecone optional): batch indexing, cost estimates, validation, and reset.POETRY_MIGRATION.md— Poetry dependency-management notes.ARCHITECTURE_DIAGRAMS.py— runnable script that prints the data-lifecycle and query-flow diagrams.
| You want to… | Use |
|---|---|
| Run the public demo | make demo → main.py::run_demo |
| Build a contextual index from loaded emails | build_contextual_index(...) in src/indexing/contextual_index.py |
| Query a collection | build_hybrid_searcher(collection).search() / .search_threads() in src/query/hybrid.py |
| Load emails from a source | load_emails(source="enron" | "mail_archive_x" | "azure_blob") in src/data/ |
Older docs may mention
EmailIndexer(src/indexing/indexer.py) andEmailQueryEngine(src/query/engine.py). Those classes have been retired — the live replacements are the three rows above.