smart-mcp-proxy · Dumbris · May 31, 2026 · May 31, 2026 · May 31, 2026 · May 31, 2026
diff --git a/specs/065-evaluation-foundation/datasets/README.md b/specs/065-evaluation-foundation/datasets/README.md
@@ -1,6 +1,67 @@
 # Spec 065 — Evaluation datasets
 
-## `security_corpus_v1.json` (D2)
+Versioned, frozen evaluation artifacts for the tool-retrieval (D1) and security
+(D2) benchmarks. **Immutable once committed — a refresh is `*_v2.*`, never an edit
+of a `*_v1.*` file** (CN-002, FR-012).
+
+> **File-type cheat sheet — read this before running anything:**
+> - **`snapshot-servers.config.json`** is the ONLY servable file — it's a real
+>   mcpproxy config (`mcpproxy serve --config snapshot-servers.config.json`).
+> - **`corpus_v1.tools.json`** is the frozen *tool snapshot* the eval scores
+>   against — it is **NOT** a mcpproxy config; `serve --config corpus_v1.tools.json`
+>   will fail. Likewise `security_corpus_v1.json` is a labeled dataset, not a config.
+
+| File | What it is | Servable? | Committed? |
+|------|------------|-----------|------------|
+| `snapshot-servers.config.json` | mcpproxy config of 7 no-auth reference servers used to freeze the corpus (secret-free, reproducible) | **yes** (`serve --config`) | yes |
+| `corpus_v1.tools.json` | Frozen snapshot of 45 tools (`GET /api/v1/tools`) — the D1 universe the eval scores against | no (dataset) | yes |
+| `retrieval_golden_v1.json` | 47 graded queries → tool(s), relevance 0\|1\|2, ≥8 hard-negatives (FR-001); R-C (queries never name the tool) | no (dataset) | yes |
+| `baseline_v1.json` | Reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance — the CI regression anchor (FR-009). `security` section filled by D2 (CN-004) | no (dataset) | yes |
+| `security_corpus_v1.json` | D2 labeled security regression corpus (per-detector P/R/F1/FPR) | no (dataset) | yes |
+| score reports (`report.json` / `.html`) | Per-run output | no | **no** (CN-003 — stay local) |
+
+---
+
+## D1 — Tool-retrieval datasets
+
+Generated by A1's harness (`~/repos/mcp-eval`, `mcp-eval datasets` / `mcp-eval retrieval`).
+
+### Regenerate (documented + repeatable — FR-012)
+
+```bash
+# 1. Boot a throwaway mcpproxy over the committed SERVABLE config (fresh data-dir).
+mcpproxy serve --config specs/065-evaluation-foundation/datasets/snapshot-servers.config.json \
+  --data-dir /tmp/mcpproxy-corpus-snapshot --listen 127.0.0.1:8092
+#    (all 7 servers connect via npx/uvx, no tokens; quarantine disabled for a clean index)
+
+# 2. Freeze the corpus snapshot (only when intentionally cutting corpus_v2).
+cd ~/repos/mcp-eval && PYTHONPATH=src uv run python -m mcp_eval.cli datasets snapshot \
+  --out <repo>/specs/065-evaluation-foundation/datasets/corpus_v1.tools.json \
+  --base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot
+
+# 3. Validate the golden set (schema + INV-1: every tool_id ∈ corpus).
+PYTHONPATH=src uv run python -m mcp_eval.cli datasets validate \
+  --corpus .../corpus_v1.tools.json --golden .../retrieval_golden_v1.json
+
+# 4. Score + gate against the baseline (deterministic; gate = Recall@5 ≥ baseline−0.05).
+PYTHONPATH=src uv run python -m mcp_eval.cli retrieval \
+  --corpus .../corpus_v1.tools.json --golden .../retrieval_golden_v1.json \
+  --baseline .../baseline_v1.json --tolerance 0.05 \
+  --base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot
+```
+
+The golden set was seeded by intent and **hand-curated** for graded relevance and
+cross-server hard-negatives (e.g. `filesystem:search_files` vs `memory:search_nodes`;
+`sqlite:read_query` vs `filesystem:read_text_file`; `fetch:fetch` vs
+`filesystem:read_text_file`), then validated. Invariants: **INV-1** (no dangling
+labels), **INV-2** (removing a labeled tool drives that query's Recall→0 — proven by
+the harness scorer tests).
+
+---
+
+## D2 — Security regression corpus
+
+### `security_corpus_v1.json`
 
 Labeled security regression corpus the D2 detection scorer measures against
 (precision / recall / F1 / FPR per detector). Conforms to
@@ -25,7 +86,7 @@ legitimately says "ignore case"). They exist to expose noisy detectors
 (SC-004 / INV-3). Each hard-negative `id` is `hn_<attack_category>_<n>`, encoding
 the attack it mimics so false positives map back to a category.
 
-## Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)
+### Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)
 
 Every entry carries `provenance.{source,license}`, and the test fails the build
 if any license is outside the redistributable allowlist (CN-005 / INV-4).
@@ -37,7 +98,7 @@ if any license is outside the redistributable allowlist (CN-005 / INV-4).
   [Damn Vulnerable MCP](https://github.com/harishsg993010/damn-vulnerable-MCP-server)
   project.
 
-### External benchmarks (referenced, NOT vendored)
+#### External benchmarks (referenced, NOT vendored)
 
 Per CN-005 and risk R-A, the following are **referenced externally only** and no
 text from them is vendored into this repo:

diff --git a/specs/065-evaluation-foundation/datasets/baseline_v1.json b/specs/065-evaluation-foundation/datasets/baseline_v1.json
@@ -0,0 +1,26 @@
+{
+  "__doc__": "Spec 065 D1+D2 regression baseline. Retrieval metrics are TOP-LEVEL because mcp-eval RetrievalScorer reads baseline[\"recall_at\"][\"5\"] etc. directly (quickstart \u00a74: --baseline datasets/baseline_v1.json). The CI gate (FR-009/MCP-742) fails if a fresh Recall@5 < baseline.recall_at[5] - tolerance.recall_at_5. The \"security\" section is appended by the D2 issue (CN-004); it is intentionally empty here.",
+  "corpus_version": "corpus_v1",
+  "golden_version": "retrieval_golden_v1",
+  "generated_from": {
+    "harness": "mcp-eval retrieval @ cb37f84",
+    "source_config": "datasets/snapshot-servers.config.json",
+    "mcpproxy": "BM25 index over corpus_v1 (45 tools, 7 no-auth reference servers)",
+    "runs": 1,
+    "note": "Reference = current BM25 behavior, a regression anchor (NOT a quality target). Refresh requires re-freezing corpus + re-review."
+  },
+  "recall_at": {
+    "1": 0.4184397163120567,
+    "3": 0.5602836879432624,
+    "5": 0.6808510638297872,
+    "10": 0.7907801418439717
+  },
+  "mrr": 0.5684903748733535,
+  "ndcg_at_10": 0.6094872517781414,
+  "map": 0.5435916919959473,
+  "tolerance": {
+    "recall_at_5": 0.05
+  },
+  "runs_averaged": 1,
+  "security": {}
+}