Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ python server.py # Start MCP server (stdio transport)
| `python pipeline.py embed` | Generate embeddings, build SQLite DB |
| `python pipeline.py rebuild` | clone + chunk + embed |
| `python pipeline.py stats` | Print database statistics |
| `python pipeline.py verify` | Run search quality checks |
| `python pipeline.py stale` | Check for stale chunks (local + upstream) |
| `python pipeline.py freshness` | Unified freshness report (age, model, sources) |
| `python pipeline.py ingest` | Incrementally ingest new chunks |
| `python pipeline.py gotcha` | Tag chunks with known gotchas |

## Project Structure

Expand All @@ -40,6 +45,7 @@ python server.py # Start MCP server (stdio transport)
## Conventions

- **Chunk format:** Every chunker returns dicts with keys: `id`, `text`, `source`, `module_path`, `type_name`, `category`, `heading`, `file_path`
- **Database sources:** `db_sources` in config pulls chunks from SQLite databases via SQL queries. Column mapping is config-driven (`text_column`, `heading_column`, etc.). DB sources produce chunks in the same format as file-based chunkers — everything downstream (embed, search, verify) works unchanged.
- **Chunker registration:** Each chunker calls `register_chunker("name", ClassName)` at module level
- **Config-driven tools:** MCP tool names and descriptions come from `config.json`, not code
- **Embedding prefix:** Documents get `"search_document: "`, queries get `"search_query: "` (nomic-embed-text convention)
Expand All @@ -53,3 +59,4 @@ python server.py # Start MCP server (stdio transport)
- **Logging:** Both `pipeline.py` and `server.py` use Python's `logging` module with module-level loggers (`log = logging.getLogger(...)`). Pipeline configures logging in `main()`. Server logs to stderr (MCP uses stdout for protocol). CLI usage/help text stays as `print()`.
- **Transaction batching:** Pipeline wraps each embedding batch in an explicit `BEGIN`/`COMMIT` transaction. Uses `isolation_level=None` for manual control.
- **Git timeouts:** Clone and pull operations have a 120-second timeout to prevent hung pipelines
- **Index metadata:** `index_metadata` table stores `indexed_at`, `embed_model`, `embed_dimensions`, and `repo:<name>:commit` for provenance tracking. `cmd_stale` and `cmd_freshness` use this to detect model drift and upstream changes.
699 changes: 678 additions & 21 deletions LICENSE

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,12 @@ mcp-rag/
└── .github/workflows/ CI (lint + test)
```

## Development

Built as part of a local AI development infrastructure, extracted and open-sourced as a standalone tool. Development uses a structured review process — each commit addresses specific findings from code review passes (SQL injection safety, transaction correctness, logging hygiene). 61 tests with CI running lint ([ruff](https://github.com/astral-sh/ruff)) and pytest on every push.

See [commit history](https://github.com/JMRussas/mcp-rag/commits/main) for the review-driven development trail.

## Limitations

- All embeddings loaded into memory at startup — practical up to ~50k chunks (~150 MB)
Expand Down
37 changes: 35 additions & 2 deletions config.example.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,38 @@
"default_top_k": 8,
"max_top_k": 20,
"embed_dimensions": 768,
"min_score": 0.0
"min_score": 0.0,
"hybrid": false,
"retrieval_depth": 20,
"rrf_k": 60,
"confidence": {
"high": 0.85,
"medium": 0.65
},
"exclude_low_confidence": false
},
"reranker": {
"enabled": false,
"model": "cross-encoder/ms-marco-MiniLM-L6-v2",
"backend": "onnx"
},
"sources": {
"repos_dir": "data/repos",
"chunks_path": "data/chunks.jsonl"
"chunks_path": "data/chunks.jsonl",
"ingest_path": "data/ingest.jsonl"
},
"pipeline": {
"concurrency": 4,
"batch_size": 50,
"progress_interval": 100,
"max_embed_chars": 6000
},
"verify": {
"queries": [
{"query": "how to create a class", "min_results": 1},
{"lookup": "Application", "min_results": 1}
]
},
"repos": [
{
"name": "my-source",
Expand All @@ -48,5 +68,18 @@
"source_tag": "docs",
"no_recurse": false
}
],
"db_sources": [
{
"name": "coding-standards",
"type": "sqlite",
"path": "/path/to/standards.db",
"query": "SELECT id, title, body, category FROM standards",
"text_column": "body",
"id_column": "id",
"heading_column": "title",
"category_column": "category",
"source_tag": "standards"
}
]
}
Loading
Loading