Skip to content

Latest commit

 

History

History
49 lines (41 loc) · 2.86 KB

File metadata and controls

49 lines (41 loc) · 2.86 KB

Documentation index

The map of mailrag's docs. Each doc has one job; read them in the order that matches how deep you want to go.

Reader journey

  1. README.md (repo root)start here. What mailrag is, the one-command quickstart (make demo), the architecture sketch, and the case study (cleanup economics + the measured retrieval ladder).
  2. QUICKSTART.md — the 5-minute path and copy-paste usage patterns against a built collection.
  3. SETUP.md — full setup: the two conda environments, Qdrant, configuration, the local .eml pipeline (Pass-1 filter → LLM Pass-2 → build), and how to run the tests.
  4. Deep dives (read any, in any order):
    • ARCHITECTURE.md — design decisions and extension points.
    • EMAIL_PREPROCESSING.md — reply-chain stripping, posting styles, and chunk-size tuning.
    • RETRIEVAL_GUIDE.md — the retrieval stack end-to-end: dense vs learned-sparse, hybrid + RRF, reranking, and thread-aware retrieval (small→big expansion).
    • EXPERIMENTS.md — the measured findings: the cleanup funnel, the labeled-eval ladder (§9–§13), the corpus-portability result (§14), confound controls, and the negative results. Its terminology box defines the C/C′ collection labels and the two senses of "thread-aware".

Operations & reference

  • BACKENDS.md — pointing mailrag at the LLM / embedder / vector store of your choice (LM Studio, Ollama, vLLM, NVIDIA NIM, OpenAI, Qdrant): the RAG_* variables, per-backend examples, and the dense-only "sparse caveat".
  • CLOUD_STORAGE_SETUP.md — Azure Blob Storage + Qdrant Cloud (Pinecone optional): batch indexing, cost estimates, validation, and reset.
  • POETRY_MIGRATION.md — Poetry dependency-management notes.
  • ARCHITECTURE_DIAGRAMS.py — runnable script that prints the data-lifecycle and query-flow diagrams.

Live entry points (at a glance)

You want to… Use
Run the public demo make demomain.py::run_demo
Build a contextual index from loaded emails build_contextual_index(...) in src/indexing/contextual_index.py
Query a collection build_hybrid_searcher(collection).search() / .search_threads() in src/query/hybrid.py
Load emails from a source load_emails(source="enron" | "mail_archive_x" | "azure_blob") in src/data/

Older docs may mention EmailIndexer (src/indexing/indexer.py) and EmailQueryEngine (src/query/engine.py). Those classes have been retired — the live replacements are the three rows above.