feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747
Conversation
…P-42) Ship the first, fully-deterministic slice of the roadmap-#19 benchmark: the token-reduction numbers behind the "massive token savings" claim. Reuses the frozen Spec 065 tool corpus (45 tools, 7 reference servers) as a versioned, non-drifting universe and tiktoken cl100k_base (already a dep) as a reproducible model-agnostic estimator. Compares the three routing modes' static context cost: - baseline (all upstream tools loaded directly) - retrieve_tools (BM25 discovery + call_tool variants) - code_execution (orchestration + retrieve_tools) over the corpus and reports per-mode savings. Real proxy tool defs are captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded). Emits report.json + a self-contained dashboard.html (gitignored; reports never committed, per Spec 065 CN-003). Conservative by construction: input schemas excluded uniformly understates the baseline, so measured savings (65.5% / 70.3% on the 45-tool corpus) are a floor. Methodology, limitations, and the scoped-but-not-yet-built follow-ups (live run with full schemas + accuracy/latency, LLM e2e, CI publish) are in bench/README.md. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>
Deploying mcpproxy-docs with
|
| Latest commit: |
9a92d71
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://5553fbac.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://feat-mcp-42-bench-harness.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 27971074417 --repo smart-mcp-proxy/mcpproxy-go
|
… smoke test KimiReviewer finding 2: code_execution is at line 626 in mcp.go at 89f06b5, not 675 as claimed. Line numbers drift with unrelated edits and the actual function names are the stable identifier — remove all line numbers from the provenance comment to prevent future rot. KimiReviewer finding 3: add TestWriteReports_SmokeTest covering WriteReports output (JSON round-trips to Report, HTML is non-empty and contains all mode names). All 5 tests pass; golangci-lint v2 clean. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
|
CodexReviewer: changes requested (benchmark integrity). The fixture ( |
…cl. management tools (MCP-3161) The token-reduction benchmark scored only 6 hand-maintained proxy tools and omitted the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries) that both routing modes append via buildManagementTools. That undercounted the proxy-mode context cost and inflated the headline savings (Codex finding on PR #747). Replace bench/proxy_tools_v1.json with server.ProxyModeToolDefs, which builds the catalog from the live builders (buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go) so it can never drift from production and always reflects the tools the agent actually sees. This also fixes a second drift: the fixture's retrieve_tools descriptions did not match the per-mode builder descriptions. Corrected figures over the 45-tool Spec 065 corpus (name+description only): retrieve_tools ~17% (10 tools), code_execution ~43% (6 tools). Updated README and notes; the schema-exclusion claim is no longer unambiguously conservative now that large-schema management tools are in the proxy cost. Tests: bench asserts both modes include the 4 management tools; internal/server pins ProxyModeToolDefs to the builders so the catalog can't silently drift. Related #747
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
There was a problem hiding this comment.
✅ Gatekeeper approval — MCP-42 benchmark harness on corrected head 9a92d71. Full mandated gate satisfied: CodexReviewer (first review) caught inflated savings (fixture omitted management tools); BackendEngineer fixed it (derive per-mode catalog from live server builders); KimiReviewer ACCEPT (model-diverse) + QATester PASS (MCP-3162) on this head + operator-verified. Honest numbers now: retrieve_tools ~17%, code_execution ~43% (were 65.5/70.3). CI green. Author≠approver.
What
First, fully-deterministic slice of the roadmap-#19 benchmark harness (MCP-42): the token-reduction numbers behind mcpproxy's "massive token savings" claim. In-repo under
bench/(per board decision — no separate public repo).Compares the static context-token cost of the three routing modes over a frozen tool corpus:
baseline(all tools loaded)retrieve_tools(BM25 discovery)code_execution(orchestration)These are a conservative floor: input schemas are excluded uniformly (the committed corpus has none), which understates the baseline; and savings scale with tool count (real deployments expose hundreds–thousands of tools).
How
specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as a versioned, non-drifting universe (CN-002).tiktoken cl100k_base— already a repo dependency, reproducible, model-agnostic estimator. No new deps.internal/server/mcp.gointobench/proxy_tools_v1.json(provenance recorded in-file).go run ./bench/cmd/bench→report.json+ self-containeddashboard.htmlinbench/results/(gitignored; reports never committed per Spec 065 CN-003).bench/README.md.Tests
go test ./bench/— deterministic tokenizer, per-mode tool exposure, real savings in (0,1), baseline monotonicity. Race-clean.gofmt,go vet, and golangci-lint v2 (strict CI config) all clean.Scoped but NOT in this PR (tracked as follow-ups)
These need decisions / other lanes, so they're deliberately deferred (see
bench/README.md):GET /api/v1/toolsfor the exact headline number + Recall@k accuracy (reusing the Spec 065 retrieval golden set) + latency.Related #MCP-42