Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 64 additions & 3 deletions specs/065-evaluation-foundation/datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,67 @@
# Spec 065 — Evaluation datasets

## `security_corpus_v1.json` (D2)
Versioned, frozen evaluation artifacts for the tool-retrieval (D1) and security
(D2) benchmarks. **Immutable once committed — a refresh is `*_v2.*`, never an edit
of a `*_v1.*` file** (CN-002, FR-012).

> **File-type cheat sheet — read this before running anything:**
> - **`snapshot-servers.config.json`** is the ONLY servable file — it's a real
> mcpproxy config (`mcpproxy serve --config snapshot-servers.config.json`).
> - **`corpus_v1.tools.json`** is the frozen *tool snapshot* the eval scores
> against — it is **NOT** a mcpproxy config; `serve --config corpus_v1.tools.json`
> will fail. Likewise `security_corpus_v1.json` is a labeled dataset, not a config.

| File | What it is | Servable? | Committed? |
|------|------------|-----------|------------|
| `snapshot-servers.config.json` | mcpproxy config of 7 no-auth reference servers used to freeze the corpus (secret-free, reproducible) | **yes** (`serve --config`) | yes |
| `corpus_v1.tools.json` | Frozen snapshot of 45 tools (`GET /api/v1/tools`) — the D1 universe the eval scores against | no (dataset) | yes |
| `retrieval_golden_v1.json` | 47 graded queries → tool(s), relevance 0\|1\|2, ≥8 hard-negatives (FR-001); R-C (queries never name the tool) | no (dataset) | yes |
| `baseline_v1.json` | Reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance — the CI regression anchor (FR-009). `security` section filled by D2 (CN-004) | no (dataset) | yes |
| `security_corpus_v1.json` | D2 labeled security regression corpus (per-detector P/R/F1/FPR) | no (dataset) | yes |
| score reports (`report.json` / `.html`) | Per-run output | no | **no** (CN-003 — stay local) |

---

## D1 — Tool-retrieval datasets

Generated by A1's harness (`~/repos/mcp-eval`, `mcp-eval datasets` / `mcp-eval retrieval`).

### Regenerate (documented + repeatable — FR-012)

```bash
# 1. Boot a throwaway mcpproxy over the committed SERVABLE config (fresh data-dir).
mcpproxy serve --config specs/065-evaluation-foundation/datasets/snapshot-servers.config.json \
--data-dir /tmp/mcpproxy-corpus-snapshot --listen 127.0.0.1:8092
# (all 7 servers connect via npx/uvx, no tokens; quarantine disabled for a clean index)

# 2. Freeze the corpus snapshot (only when intentionally cutting corpus_v2).
cd ~/repos/mcp-eval && PYTHONPATH=src uv run python -m mcp_eval.cli datasets snapshot \
--out <repo>/specs/065-evaluation-foundation/datasets/corpus_v1.tools.json \
--base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot

# 3. Validate the golden set (schema + INV-1: every tool_id ∈ corpus).
PYTHONPATH=src uv run python -m mcp_eval.cli datasets validate \
--corpus .../corpus_v1.tools.json --golden .../retrieval_golden_v1.json

# 4. Score + gate against the baseline (deterministic; gate = Recall@5 ≥ baseline−0.05).
PYTHONPATH=src uv run python -m mcp_eval.cli retrieval \
--corpus .../corpus_v1.tools.json --golden .../retrieval_golden_v1.json \
--baseline .../baseline_v1.json --tolerance 0.05 \
--base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot
```

The golden set was seeded by intent and **hand-curated** for graded relevance and
cross-server hard-negatives (e.g. `filesystem:search_files` vs `memory:search_nodes`;
`sqlite:read_query` vs `filesystem:read_text_file`; `fetch:fetch` vs
`filesystem:read_text_file`), then validated. Invariants: **INV-1** (no dangling
labels), **INV-2** (removing a labeled tool drives that query's Recall→0 — proven by
the harness scorer tests).

---

## D2 — Security regression corpus

### `security_corpus_v1.json`

Labeled security regression corpus the D2 detection scorer measures against
(precision / recall / F1 / FPR per detector). Conforms to
Expand All @@ -25,7 +86,7 @@ legitimately says "ignore case"). They exist to expose noisy detectors
(SC-004 / INV-3). Each hard-negative `id` is `hn_<attack_category>_<n>`, encoding
the attack it mimics so false positives map back to a category.

## Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)
### Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)

Every entry carries `provenance.{source,license}`, and the test fails the build
if any license is outside the redistributable allowlist (CN-005 / INV-4).
Expand All @@ -37,7 +98,7 @@ if any license is outside the redistributable allowlist (CN-005 / INV-4).
[Damn Vulnerable MCP](https://github.com/harishsg993010/damn-vulnerable-MCP-server)
project.

### External benchmarks (referenced, NOT vendored)
#### External benchmarks (referenced, NOT vendored)

Per CN-005 and risk R-A, the following are **referenced externally only** and no
text from them is vendored into this repo:
Expand Down
26 changes: 26 additions & 0 deletions specs/065-evaluation-foundation/datasets/baseline_v1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"__doc__": "Spec 065 D1+D2 regression baseline. Retrieval metrics are TOP-LEVEL because mcp-eval RetrievalScorer reads baseline[\"recall_at\"][\"5\"] etc. directly (quickstart \u00a74: --baseline datasets/baseline_v1.json). The CI gate (FR-009/MCP-742) fails if a fresh Recall@5 < baseline.recall_at[5] - tolerance.recall_at_5. The \"security\" section is appended by the D2 issue (CN-004); it is intentionally empty here.",
"corpus_version": "corpus_v1",
"golden_version": "retrieval_golden_v1",
"generated_from": {
"harness": "mcp-eval retrieval @ cb37f84",
"source_config": "datasets/snapshot-servers.config.json",
"mcpproxy": "BM25 index over corpus_v1 (45 tools, 7 no-auth reference servers)",
"runs": 1,
"note": "Reference = current BM25 behavior, a regression anchor (NOT a quality target). Refresh requires re-freezing corpus + re-review."
},
"recall_at": {
"1": 0.4184397163120567,
"3": 0.5602836879432624,
"5": 0.6808510638297872,
"10": 0.7907801418439717
},
"mrr": 0.5684903748733535,
"ndcg_at_10": 0.6094872517781414,
"map": 0.5435916919959473,
"tolerance": {
"recall_at_5": 0.05
},
"runs_averaged": 1,
"security": {}
}
Loading
Loading