Skip to content

feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747

Merged
Dumbris merged 3 commits into
mainfrom
feat/mcp-42-bench-harness
Jun 22, 2026
Merged

feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747
Dumbris merged 3 commits into
mainfrom
feat/mcp-42-bench-harness

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 22, 2026

Copy link
Copy Markdown
Member

What

First, fully-deterministic slice of the roadmap-#19 benchmark harness (MCP-42): the token-reduction numbers behind mcpproxy's "massive token savings" claim. In-repo under bench/ (per board decision — no separate public repo).

Compares the static context-token cost of the three routing modes over a frozen tool corpus:

Mode Tools in context Tokens Savings
baseline (all tools loaded) 45 1730
retrieve_tools (BM25 discovery) 5 596 65.5%
code_execution (orchestration) 2 513 70.3%

These are a conservative floor: input schemas are excluded uniformly (the committed corpus has none), which understates the baseline; and savings scale with tool count (real deployments expose hundreds–thousands of tools).

How

  • Reuses the Spec 065 frozen corpus (specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as a versioned, non-drifting universe (CN-002).
  • Tokenizer: tiktoken cl100k_base — already a repo dependency, reproducible, model-agnostic estimator. No new deps.
  • Real proxy tool definitions captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded in-file).
  • go run ./bench/cmd/benchreport.json + self-contained dashboard.html in bench/results/ (gitignored; reports never committed per Spec 065 CN-003).
  • Methodology, scoring rubric, dataset sources, known limitations, reviewer contact: bench/README.md.

Tests

  • TDD: go test ./bench/ — deterministic tokenizer, per-mode tool exposure, real savings in (0,1), baseline monotonicity. Race-clean.
  • gofmt, go vet, and golangci-lint v2 (strict CI config) all clean.

Scoped but NOT in this PR (tracked as follow-ups)

These need decisions / other lanes, so they're deliberately deferred (see bench/README.md):

  • Live run (docker-compose skeleton included): full schemas from GET /api/v1/tools for the exact headline number + Recall@k accuracy (reusing the Spec 065 retrieval golden set) + latency.
  • End-to-end task success with a pinned LLM — needs a pinned model + LLM-call budget.
  • CI publish-on-release-tag → public dashboard — Release/DevOps lane.

Related #MCP-42

…P-42)

Ship the first, fully-deterministic slice of the roadmap-#19 benchmark: the
token-reduction numbers behind the "massive token savings" claim. Reuses the
frozen Spec 065 tool corpus (45 tools, 7 reference servers) as a versioned,
non-drifting universe and tiktoken cl100k_base (already a dep) as a
reproducible model-agnostic estimator.

Compares the three routing modes' static context cost:
- baseline (all upstream tools loaded directly)
- retrieve_tools (BM25 discovery + call_tool variants)
- code_execution (orchestration + retrieve_tools)

over the corpus and reports per-mode savings. Real proxy tool defs are captured
verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance
recorded). Emits report.json + a self-contained dashboard.html (gitignored;
reports never committed, per Spec 065 CN-003).

Conservative by construction: input schemas excluded uniformly understates the
baseline, so measured savings (65.5% / 70.3% on the 45-tool corpus) are a floor.

Methodology, limitations, and the scoped-but-not-yet-built follow-ups (live run
with full schemas + accuracy/latency, LLM e2e, CI publish) are in bench/README.md.

Related #MCP-42

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9a92d71
Status: ✅  Deploy successful!
Preview URL: https://5553fbac.mcpproxy-docs.pages.dev
Branch Preview URL: https://feat-mcp-42-bench-harness.mcpproxy-docs.pages.dev

View logs

@codecov-commenter

codecov-commenter commented Jun 22, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 63.50365% with 50 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
bench/cmd/bench/main.go 0.00% 22 Missing ⚠️
bench/report.go 44.82% 8 Missing and 8 partials ⚠️
bench/tokens.go 79.59% 5 Missing and 5 partials ⚠️
bench/proxytools.go 88.88% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/mcp-42-bench-harness

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (28 MB)
  • archive-windows-arm64 (25 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 27971074417 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

… smoke test

KimiReviewer finding 2: code_execution is at line 626 in mcp.go at 89f06b5,
not 675 as claimed. Line numbers drift with unrelated edits and the actual
function names are the stable identifier — remove all line numbers from the
provenance comment to prevent future rot.

KimiReviewer finding 3: add TestWriteReports_SmokeTest covering WriteReports
output (JSON round-trips to Report, HTML is non-empty and contains all mode
names). All 5 tests pass; golangci-lint v2 clean.

Related #MCP-42

Co-Authored-By: Paperclip <noreply@paperclip.ing>

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

@Dumbris

Dumbris commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

CodexReviewer: changes requested (benchmark integrity). The fixture (bench/proxy_tools_v1.json) + README model only 6 proxy tools and assume minimal per-mode tool sets, but the real routing modes append the shared management tools (mcp_routing.go:337/437, mcp.go:656/780) — so proxy context cost is undercounted and the 65.5%/70.3% savings are overstated. Derive the per-mode tool catalog from the server builders (not a hand-maintained JSON) and re-run so the headline numbers are real. Details on MCP-42.

…cl. management tools (MCP-3161)

The token-reduction benchmark scored only 6 hand-maintained proxy tools and
omitted the shared management tool set (upstream_servers, quarantine_security,
search_servers, list_registries) that both routing modes append via
buildManagementTools. That undercounted the proxy-mode context cost and
inflated the headline savings (Codex finding on PR #747).

Replace bench/proxy_tools_v1.json with server.ProxyModeToolDefs, which builds
the catalog from the live builders (buildCallToolModeTools /
buildCodeExecModeTools in internal/server/mcp_routing.go) so it can never drift
from production and always reflects the tools the agent actually sees. This
also fixes a second drift: the fixture's retrieve_tools descriptions did not
match the per-mode builder descriptions.

Corrected figures over the 45-tool Spec 065 corpus (name+description only):
retrieve_tools ~17% (10 tools), code_execution ~43% (6 tools). Updated README
and notes; the schema-exclusion claim is no longer unambiguously conservative
now that large-schema management tools are in the proxy cost.

Tests: bench asserts both modes include the 4 management tools; internal/server
pins ProxyModeToolDefs to the builders so the catalog can't silently drift.

Related #747

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Gatekeeper approval — MCP-42 benchmark harness on corrected head 9a92d71. Full mandated gate satisfied: CodexReviewer (first review) caught inflated savings (fixture omitted management tools); BackendEngineer fixed it (derive per-mode catalog from live server builders); KimiReviewer ACCEPT (model-diverse) + QATester PASS (MCP-3162) on this head + operator-verified. Honest numbers now: retrieve_tools ~17%, code_execution ~43% (were 65.5/70.3). CI green. Author≠approver.

@Dumbris Dumbris merged commit 4a24175 into main Jun 22, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants