JMRussas · JMRussas · Mar 8, 2026 · Feb 28, 2026 · Feb 28, 2026 · Feb 28, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -20,6 +20,11 @@ python server.py                      # Start MCP server (stdio transport)
 | `python pipeline.py embed` | Generate embeddings, build SQLite DB |
 | `python pipeline.py rebuild` | clone + chunk + embed |
 | `python pipeline.py stats` | Print database statistics |
+| `python pipeline.py verify` | Run search quality checks |
+| `python pipeline.py stale` | Check for stale chunks (local + upstream) |
+| `python pipeline.py freshness` | Unified freshness report (age, model, sources) |
+| `python pipeline.py ingest` | Incrementally ingest new chunks |
+| `python pipeline.py gotcha` | Tag chunks with known gotchas |
 
 ## Project Structure
 
@@ -40,6 +45,7 @@ python server.py                      # Start MCP server (stdio transport)
 ## Conventions
 
 - **Chunk format:** Every chunker returns dicts with keys: `id`, `text`, `source`, `module_path`, `type_name`, `category`, `heading`, `file_path`
+- **Database sources:** `db_sources` in config pulls chunks from SQLite databases via SQL queries. Column mapping is config-driven (`text_column`, `heading_column`, etc.). DB sources produce chunks in the same format as file-based chunkers — everything downstream (embed, search, verify) works unchanged.
 - **Chunker registration:** Each chunker calls `register_chunker("name", ClassName)` at module level
 - **Config-driven tools:** MCP tool names and descriptions come from `config.json`, not code
 - **Embedding prefix:** Documents get `"search_document: "`, queries get `"search_query: "` (nomic-embed-text convention)
@@ -53,3 +59,4 @@ python server.py                      # Start MCP server (stdio transport)
 - **Logging:** Both `pipeline.py` and `server.py` use Python's `logging` module with module-level loggers (`log = logging.getLogger(...)`). Pipeline configures logging in `main()`. Server logs to stderr (MCP uses stdout for protocol). CLI usage/help text stays as `print()`.
 - **Transaction batching:** Pipeline wraps each embedding batch in an explicit `BEGIN`/`COMMIT` transaction. Uses `isolation_level=None` for manual control.
 - **Git timeouts:** Clone and pull operations have a 120-second timeout to prevent hung pipelines
+- **Index metadata:** `index_metadata` table stores `indexed_at`, `embed_model`, `embed_dimensions`, and `repo:<name>:commit` for provenance tracking. `cmd_stale` and `cmd_freshness` use this to detect model drift and upstream changes.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -207,6 +207,12 @@ mcp-rag/
 └── .github/workflows/     CI (lint + test)
 ```
 
+## Development
+
+Built as part of a local AI development infrastructure, extracted and open-sourced as a standalone tool. Development uses a structured review process — each commit addresses specific findings from code review passes (SQL injection safety, transaction correctness, logging hygiene). 61 tests with CI running lint ([ruff](https://github.com/astral-sh/ruff)) and pytest on every push.
+
+See [commit history](https://github.com/JMRussas/mcp-rag/commits/main) for the review-driven development trail.
+
 ## Limitations
 
 - All embeddings loaded into memory at startup — practical up to ~50k chunks (~150 MB)

diff --git a/config.example.json b/config.example.json
@@ -22,18 +22,38 @@
     "default_top_k": 8,
     "max_top_k": 20,
     "embed_dimensions": 768,
-    "min_score": 0.0
+    "min_score": 0.0,
+    "hybrid": false,
+    "retrieval_depth": 20,
+    "rrf_k": 60,
+    "confidence": {
+      "high": 0.85,
+      "medium": 0.65
+    },
+    "exclude_low_confidence": false
+  },
+  "reranker": {
+    "enabled": false,
+    "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
+    "backend": "onnx"
   },
   "sources": {
     "repos_dir": "data/repos",
-    "chunks_path": "data/chunks.jsonl"
+    "chunks_path": "data/chunks.jsonl",
+    "ingest_path": "data/ingest.jsonl"
   },
   "pipeline": {
     "concurrency": 4,
     "batch_size": 50,
     "progress_interval": 100,
     "max_embed_chars": 6000
   },
+  "verify": {
+    "queries": [
+      {"query": "how to create a class", "min_results": 1},
+      {"lookup": "Application", "min_results": 1}
+    ]
+  },
   "repos": [
     {
       "name": "my-source",
@@ -48,5 +68,18 @@
       "source_tag": "docs",
       "no_recurse": false
     }
+  ],
+  "db_sources": [
+    {
+      "name": "coding-standards",
+      "type": "sqlite",
+      "path": "/path/to/standards.db",
+      "query": "SELECT id, title, body, category FROM standards",
+      "text_column": "body",
+      "id_column": "id",
+      "heading_column": "title",
+      "category_column": "category",
+      "source_tag": "standards"
+    }
   ]
 }