Documentation index

The map of mailrag's docs. Each doc has one job; read them in the order that matches how deep you want to go.

Reader journey

README.md (repo root) — start here. What mailrag is, the one-command quickstart (make demo), the architecture sketch, and the case study (cleanup economics + the measured retrieval ladder).
QUICKSTART.md — the 5-minute path and copy-paste usage patterns against a built collection.
SETUP.md — full setup: the two conda environments, Qdrant, configuration, the local .eml pipeline (Pass-1 filter → LLM Pass-2 → build), and how to run the tests.
Deep dives (read any, in any order):
- ARCHITECTURE.md — design decisions and extension points.
- EMAIL_PREPROCESSING.md — reply-chain stripping, posting styles, and chunk-size tuning.
- RETRIEVAL_GUIDE.md — the retrieval stack end-to-end: dense vs learned-sparse, hybrid + RRF, reranking, and thread-aware retrieval (small→big expansion).
- EXPERIMENTS.md — the measured findings: the cleanup funnel, the labeled-eval ladder (§9–§13), the corpus-portability result (§14), confound controls, and the negative results. Its terminology box defines the C/C′ collection labels and the two senses of "thread-aware".

Operations & reference

BACKENDS.md — pointing mailrag at the LLM / embedder / vector store of your choice (LM Studio, Ollama, vLLM, NVIDIA NIM, OpenAI, Qdrant): the RAG_* variables, per-backend examples, and the dense-only "sparse caveat".
CLOUD_STORAGE_SETUP.md — Azure Blob Storage + Qdrant Cloud (Pinecone optional): batch indexing, cost estimates, validation, and reset.
POETRY_MIGRATION.md — Poetry dependency-management notes.
ARCHITECTURE_DIAGRAMS.py — runnable script that prints the data-lifecycle and query-flow diagrams.

Live entry points (at a glance)

You want to…	Use
Run the public demo	`make demo` → `main.py::run_demo`
Build a contextual index from loaded emails	`build_contextual_index(...)` in `src/indexing/contextual_index.py`
Query a collection	`build_hybrid_searcher(collection).search()` / `.search_threads()` in `src/query/hybrid.py`
Load emails from a source	`load_emails(source="enron" \| "mail_archive_x" \| "azure_blob")` in `src/data/`

Older docs may mention EmailIndexer (src/indexing/indexer.py) and EmailQueryEngine (src/query/engine.py). Those classes have been retired — the live replacements are the three rows above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation index

Reader journey

Operations & reference

Live entry points (at a glance)

FilesExpand file tree

INDEX.md

Latest commit

History

INDEX.md

File metadata and controls

Documentation index

Reader journey

Operations & reference

Live entry points (at a glance)