feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) by Dumbris · Pull Request #748 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-22T18:08:42Z

Summary

Second slice of the MCP-42 benchmark harness, extending bench/ (PR #747) with a live run against a running proxy. Deterministic and LLM-free.

Adds the three measurements the issue asked for:

Exact token number (full schemas). GET /api/v1/tools pulls upstream tools with their full JSON input schemas; the proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema, marshaled from the real tools/list InputSchema). Schemas are counted on both sides, so the headline savings is authoritative.
- MCP-3161 guard: if any proxy tool lacks a schema, counting the baseline's schemas alone would overstate savings, so the run withholds the headline % (authoritative_headline: false) and reports raw token totals only.
Accuracy. Replays the Spec 065 golden set (retrieval_golden_v1.json) through the proxy's BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}, MRR, nDCG@10, MAP against graded labels. Field names mirror Spec 065 score-report.schema.json.
Latency. Client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost. Client-side on purpose — the server SearchToolsResponse.took is a "0ms" stub.

How to run

docker compose -f bench/docker-compose.yml up --build -d
go run ./bench/cmd/bench -live -proxy http://127.0.0.1:8092 -api-key eval-corpus-snapshot

-live writes bench/results/live_report.json (gitignored, CN-003). Default (no -live) keeps the offline token run unchanged.

Design notes

Reuses the external mcp-eval D1 approach (re-implemented in Go), not its code.
Proxy schemas come from the live builders (zero drift) — no hand-maintained fixture, no separate MCP handshake.

Tests

metrics_test.go — Recall@k/MRR/nDCG/MAP against hand-computed worked examples.
live_test.go — httptest-stubbed /api/v1/tools + /api/v1/index/search; schema-aware token counting.
live_report_test.go — real golden-set load (47 queries), latency percentiles, authoritative-headline path, and the MCP-3161 withhold guard.
go test ./bench/... -race, go vet, and strict golangci-lint v2 all clean.

Out of scope (follow-ups)

LLM end-to-end task success (pinned model + budget); CI publish-on-tag (Release lane, MCP-3133).

Closes MCP-3132.

…MCP-42a) Extends the bench/ harness (PR #747) with a live run against a running proxy: - Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON input schemas; proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH sides so the headline savings is authoritative — and withheld (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161 overstatement guard. - Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP against graded labels (deterministic, no LLM). Field names mirror Spec 065 score-report.schema.json. - Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost (server "took" is a 0ms stub). CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay gitignored (CN-003). All metric math + the live client are unit-tested with httptest stubs; the docker-compose substrate is the live-reproduction path. Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-06-22T18:11:10Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`0602786`
Status:	✅ Deploy successful!
Preview URL:	https://2112823c.mcpproxy-docs.pages.dev
Branch Preview URL:	https://feat-mcp-42a-live-bench.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-06-22T18:14:17Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 67.34694% with 112 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
bench/cmd/bench/main.go	0.00%	44 Missing ⚠️
bench/live_report.go	75.00%	18 Missing and 8 partials ⚠️
bench/live.go	69.23%	12 Missing and 12 partials ⚠️
bench/metrics.go	81.81%	9 Missing and 9 partials ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-22T18:16:12Z

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/mcp-42a-live-bench

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (28 MB)
archive-windows-arm64 (25 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 27976001588 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

ConvertGenericToolsToTyped read generic["schema"], but every producer of the generic tool map (runtime/server GetServerTools, mcp.go) emits the upstream input schema under "inputSchema". The /api/v1/tools response therefore dropped every schema, so the MCP-42a live benchmark baseline was silently a description-only token count instead of the required full-schema count, while still able to emit authoritative_headline=true. - Read "inputSchema" first in the converter, keep "schema" as a legacy fallback. - Gate the live headline on baseline schemas too (BaselineSchemasCounted via anyHaveSchema): a systematically schema-less baseline now withholds the headline instead of claiming a full-schema baseline it never had. - Tests: converter preserves inputSchema (+legacy schema fallback); headline withheld when the baseline carries no schemas. Related #748

mcpproxy-gatekeeper

✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

…hema Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval` payload emitted flat metric fields, but score-report.schema.json requires nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into {metrics, gate} so live_report.json validates against the contract, proven by a new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema). A standalone live run has no stored baseline, so gate.passed is true by construction (CI regression-gating against a committed baseline is MCP-3133). Co-Authored-By: Paperclip <noreply@paperclip.ing>

mcpproxy-gatekeeper

✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

mcpproxy-gatekeeper

✅ Gatekeeper approval — MCP-42a live benchmark run (full schemas + Recall@k + latency). CodexReviewer ACCEPT + QATester PASS on head 0602786. Live CLI (-live/-proxy/-api-key/-golden) writes live_report.json without changing offline mode; pulls /api/v1/tools schemas + scores /api/v1/index/search with client-measured latency. CI green. Author≠approver.

mcpproxy-gatekeeper Bot approved these changes Jun 22, 2026

View reviewed changes

mcpproxy-gatekeeper Bot approved these changes Jun 23, 2026

View reviewed changes

Dumbris merged commit e3588fa into main Jun 23, 2026
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a)#748

feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a)#748
Dumbris merged 3 commits into
mainfrom
feat/mcp-42a-live-bench

Dumbris commented Jun 22, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 22, 2026

Summary

How to run

Design notes

Tests

Out of scope (follow-ups)

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented Jun 22, 2026 •

edited

Loading

codecov-commenter commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading