Skip to content

feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a)#748

Merged
Dumbris merged 3 commits into
mainfrom
feat/mcp-42a-live-bench
Jun 23, 2026
Merged

feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a)#748
Dumbris merged 3 commits into
mainfrom
feat/mcp-42a-live-bench

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

Second slice of the MCP-42 benchmark harness, extending bench/ (PR #747) with a live run against a running proxy. Deterministic and LLM-free.

Adds the three measurements the issue asked for:

  • Exact token number (full schemas). GET /api/v1/tools pulls upstream tools with their full JSON input schemas; the proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema, marshaled from the real tools/list InputSchema). Schemas are counted on both sides, so the headline savings is authoritative.
    • MCP-3161 guard: if any proxy tool lacks a schema, counting the baseline's schemas alone would overstate savings, so the run withholds the headline % (authoritative_headline: false) and reports raw token totals only.
  • Accuracy. Replays the Spec 065 golden set (retrieval_golden_v1.json) through the proxy's BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}, MRR, nDCG@10, MAP against graded labels. Field names mirror Spec 065 score-report.schema.json.
  • Latency. Client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost. Client-side on purpose — the server SearchToolsResponse.took is a "0ms" stub.

How to run

docker compose -f bench/docker-compose.yml up --build -d
go run ./bench/cmd/bench -live -proxy http://127.0.0.1:8092 -api-key eval-corpus-snapshot

-live writes bench/results/live_report.json (gitignored, CN-003). Default (no -live) keeps the offline token run unchanged.

Design notes

  • Reuses the external mcp-eval D1 approach (re-implemented in Go), not its code.
  • Proxy schemas come from the live builders (zero drift) — no hand-maintained fixture, no separate MCP handshake.

Tests

  • metrics_test.go — Recall@k/MRR/nDCG/MAP against hand-computed worked examples.
  • live_test.gohttptest-stubbed /api/v1/tools + /api/v1/index/search; schema-aware token counting.
  • live_report_test.go — real golden-set load (47 queries), latency percentiles, authoritative-headline path, and the MCP-3161 withhold guard.
  • go test ./bench/... -race, go vet, and strict golangci-lint v2 all clean.

Out of scope (follow-ups)

LLM end-to-end task success (pinned model + budget); CI publish-on-tag (Release lane, MCP-3133).

Closes MCP-3132.

…MCP-42a)

Extends the bench/ harness (PR #747) with a live run against a running proxy:

- Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON
  input schemas; proxy-mode tools carry their live schemas via the extended
  server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH
  sides so the headline savings is authoritative — and withheld
  (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161
  overstatement guard.
- Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25
  search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP
  against graded labels (deterministic, no LLM). Field names mirror Spec 065
  score-report.schema.json.
- Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the
  one-shot load-all-tools cost (server "took" is a 0ms stub).

CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay
gitignored (CN-003). All metric math + the live client are unit-tested with
httptest stubs; the docker-compose substrate is the live-reproduction path.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0602786
Status: ✅  Deploy successful!
Preview URL: https://2112823c.mcpproxy-docs.pages.dev
Branch Preview URL: https://feat-mcp-42a-live-bench.mcpproxy-docs.pages.dev

View logs

@codecov-commenter

codecov-commenter commented Jun 22, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 67.34694% with 112 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
bench/cmd/bench/main.go 0.00% 44 Missing ⚠️
bench/live_report.go 75.00% 18 Missing and 8 partials ⚠️
bench/live.go 69.23% 12 Missing and 12 partials ⚠️
bench/metrics.go 81.81% 9 Missing and 9 partials ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/mcp-42a-live-bench

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (28 MB)
  • archive-windows-arm64 (25 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 27976001588 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

ConvertGenericToolsToTyped read generic["schema"], but every producer
of the generic tool map (runtime/server GetServerTools, mcp.go) emits the
upstream input schema under "inputSchema". The /api/v1/tools response
therefore dropped every schema, so the MCP-42a live benchmark baseline was
silently a description-only token count instead of the required full-schema
count, while still able to emit authoritative_headline=true.

- Read "inputSchema" first in the converter, keep "schema" as a legacy fallback.
- Gate the live headline on baseline schemas too (BaselineSchemasCounted via
  anyHaveSchema): a systematically schema-less baseline now withholds the
  headline instead of claiming a full-schema baseline it never had.
- Tests: converter preserves inputSchema (+legacy schema fallback); headline
  withheld when the baseline carries no schemas.

Related #748

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

…hema

Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval`
payload emitted flat metric fields, but score-report.schema.json requires
nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into
{metrics, gate} so live_report.json validates against the contract, proven by a
new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema).

A standalone live run has no stored baseline, so gate.passed is true by
construction (CI regression-gating against a committed baseline is MCP-3133).

Co-Authored-By: Paperclip <noreply@paperclip.ing>

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Gatekeeper approval — MCP-42a live benchmark run (full schemas + Recall@k + latency). CodexReviewer ACCEPT + QATester PASS on head 0602786. Live CLI (-live/-proxy/-api-key/-golden) writes live_report.json without changing offline mode; pulls /api/v1/tools schemas + scores /api/v1/index/search with client-measured latency. CI green. Author≠approver.

@Dumbris Dumbris merged commit e3588fa into main Jun 23, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants