feat(065): security_corpus_v1.json — D2 security regression dataset (MCP-739) by Dumbris · Pull Request #551 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-05-31T13:29:13Z

Summary

Adds the D2 security regression corpus for Spec 065 Evaluation Foundation — the labeled dataset the detection scorer measures precision/recall/F1/FPR against. Implements MCP-739 (Gate 2 design accepted).

specs/065-evaluation-foundation/datasets/security_corpus_v1.json — 43 entries
specs/065-evaluation-foundation/datasets/corpus_test.go — TDD validator (schema + invariants)
specs/065-evaluation-foundation/datasets/README.md — composition + provenance/licensing notes

Composition (43 entries)

Label	Category	Count
malicious	tool_poisoning	6
malicious	prompt_injection	6
malicious	shadowing	4
malicious	rug_pull	4
benign	benign (clean base rate)	15
benign	hard_negative (attack-resembling)	8 (2 per attack category)

Hard negatives expose noisy detectors (SC-004 / INV-3); each id is hn_<attack_category>_<n> so a false positive maps back to the attack it resembles.

Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)

Sources limited to self-authored (dominant) + DVMCP (MIT).
MCPTox / MCP-AttackBench (restrictive) and mcp-injection-experiments (LICENSE unconfirmed, R-A) are referenced externally only — never vendored. The validator rejects any entry sourced from them or carrying a non-redistributable license (INV-4).

Verification

corpus_test.go validates the dataset against contracts/security-corpus.schema.json (santhosh-tekuri/jsonschema/v6) and asserts attack coverage + ≥1 hard negative per attack category.

$ go test ./specs/065-evaluation-foundation/datasets/ -v
=== RUN   TestCorpus_SchemaAndStructure
--- PASS: TestCorpus_SchemaAndStructure (0.00s)
=== RUN   TestCorpus_AttackCoverageAndHardNegatives
--- PASS: TestCorpus_AttackCoverageAndHardNegatives (0.00s)
PASS
ok  	github.com/smart-mcp-proxy/mcpproxy-go/specs/065-evaluation-foundation/datasets	0.227s

go vet, gofmt -l, and go build ./... all clean.

Scope

Data + test-tooling only — no internal//cmd/ runtime code. The cmd/scan-eval scorer bridge is sibling issue MCP-738. Unblocks B3.

Related #739

Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-05-31T13:29:14Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`045de45`
Status:	✅ Deploy successful!
Preview URL:	https://290dd077.mcpproxy-docs.pages.dev
Branch Preview URL:	https://739-security-corpus.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-05-31T13:33:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-31T13:37:47Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 739-security-corpus

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26713946090 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery First implementation epic of the H2-2026 roadmap. Move from asserting to measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen corpus, the prerequisite + GEPA fitness function) and D2 security regression corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet- security metric). Both extend the existing mcp-eval harness; gated in CI. D3/D4/D5/D6 are follow-on specs that compose on these. * docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST /api/v1/index/search) and D2 (security regression corpus, per-detector precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart. Constitution PASS. D3/D4/D5/D6 remain follow-on specs. * feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550) Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): security_corpus_v1.json (D2 security regression dataset) (#551) Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561) * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 readiness probe — parse data.results from index/search envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): D1 readiness waits for full tool catalog before scoring The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563) The eval.yml pull_request.paths filter only matched the D2 security surface (internal/security, cmd/scan-eval) and the harness/datasets, so changes to the D1 retrieval system it gates — BM25 index, MCP tool- discovery routing, the REST search envelope, server/CLI boot — never triggered the workflow. Spec 065 requires CI to catch discovery regressions when search/index/tool-discovery behavior changes. Add internal/index/**, internal/server/**, internal/httpapi/**, and cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2 security and D1 retrieval stay green. Related MCP-833 --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris merged commit 371af22 into 065-evaluation-foundation May 31, 2026
25 checks passed

Dumbris deleted the 739-security-corpus branch May 31, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(065): security_corpus_v1.json — D2 security regression dataset (MCP-739)#551

feat(065): security_corpus_v1.json — D2 security regression dataset (MCP-739)#551
Dumbris merged 1 commit into
065-evaluation-foundationfrom
739-security-corpus

Dumbris commented May 31, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Dumbris commented May 31, 2026

Summary

Composition (43 entries)

Provenance & licensing (FR-007 / CN-005 / R-07 / R-A)

Verification

Scope

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented May 31, 2026

Codecov Report

Uh oh!

github-actions Bot commented May 31, 2026

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented May 31, 2026 •

edited

Loading