smart-mcp-proxy · Dumbris · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/bench/.gitignore b/bench/.gitignore
@@ -0,0 +1,2 @@
+# Benchmark run artifacts are never committed (Spec 065 CN-003).
+results/
diff --git a/bench/README.md b/bench/README.md
@@ -0,0 +1,135 @@
+# mcpproxy benchmark harness
+
+The reproducible numbers behind mcpproxy's marketing claims — **token reduction**,
+**discovery accuracy**, and **latency** — comparing three ways an agent can be
+wired to upstream MCP tools.
+
+> Roadmap item #19 (MCP-42). In-repo (`bench/`), reproducible, intended to be
+> refreshed on release. Reports are **never committed** (Spec 065 CN-003); only
+> code, fixtures, and this methodology are versioned.
+
+## The three modes
+
+| Mode | What the agent sees in context | mcpproxy server |
+|------|--------------------------------|-----------------|
+| `baseline` | Every upstream tool definition, loaded directly | (no proxy discovery) |
+| `retrieve_tools` | `retrieve_tools` + `call_tool_read/write/destructive` + `read_cache` + `code_execution` + management tools; tools found on demand via BM25 | `callToolServer` |
+| `code_execution` | `code_execution` + `retrieve_tools` + management tools; many tools orchestrated from sandboxed JS in one round-trip | `codeExecServer` |
+
+Both proxy modes also append the shared **management tool set** —
+`upstream_servers`, `quarantine_security`, `search_servers`, `list_registries`
+— that the live routing-mode servers expose. These count against the proxy
+context cost: omitting them undercounts that cost and inflates the savings.
+
+The per-mode catalog is **derived directly from the live tool builders**
+(`buildCallToolModeTools` / `buildCodeExecModeTools` in
+`internal/server/mcp_routing.go`, via `server.ProxyModeToolDefs`), so it can
+never drift from production.
+
+## What ships today (deterministic, offline)
+
+The **token-reduction** measurement is fully deterministic and runs with no
+network or LLM:
+
+```bash
+go run ./bench/cmd/bench            # scores the committed Spec 065 corpus
+go test ./bench/                    # unit + invariant tests
+```
+
+It counts the context-token cost of each mode over a **frozen tool corpus** and
+reports the savings of each proxy mode versus the baseline. Output: a
+`report.json` and a self-contained `dashboard.html` in `bench/results/`
+(gitignored).
+
+#### Current deterministic result
+
+Over the 45-tool Spec 065 reference corpus, counting **tool name + description
+only** (schemas excluded uniformly — see limitations), `cl100k_base`:
+
+| Mode | Context tools | Tokens | Savings vs. baseline |
+|------|---------------|--------|----------------------|
+| `baseline` | 45 | 1730 | — |
+| `retrieve_tools` | 10 | 1431 | **~17%** |
+| `code_execution` | 6 | 986 | **~43%** |
+
+These are deliberately modest: the proxy context here is the *full* per-mode
+tool set (discovery + call-tool variants + management tools), and the corpus is
+small. Savings grow toward the asymptote as the upstream tool count rises (the
+baseline grows linearly while the proxy context stays fixed) — always quote the
+corpus size alongside a percentage. Reproduce with `go run ./bench/cmd/bench`.
+
+### Scoring rubric — token reduction
+
+- **Tool universe**: the frozen Spec 065 snapshot
+  `specs/065-evaluation-foundation/datasets/corpus_v1.tools.json` — 45 tools
+  across 7 no-auth reference servers. Frozen + versioned so scoring never runs
+  against a drifting corpus (CN-002).
+- **Tokenizer**: `tiktoken cl100k_base`, a widely-used reproducible BPE
+  (already a repo dependency). It is a **model-agnostic estimator**; exact
+  counts for a specific pinned model (e.g. Claude) will differ, but the
+  *relative* savings between modes are stable.
+- **Proxy-mode tools**: the *complete* per-mode catalog, derived from the live
+  server builders — discovery, the call-tool variants, `code_execution`, **and
+  the shared management tool set** (`upstream_servers`, `quarantine_security`,
+  `search_servers`, `list_registries`). Nothing the agent actually sees is
+  dropped from the proxy cost.
+- **Cost of a tool**: `name + "\n" + description`. JSON input schemas are
+  excluded **uniformly** across all modes (the committed corpus snapshot does
+  not carry schemas).
+- **Savings** for a mode `m`: `1 - tokens(m) / tokens(baseline)`.
+
+### Known limitations (read before quoting a number)
+
+- **Schemas excluded — direction is not clean.** Input schemas are dropped from
+  *both* sides. The 45 baseline tools lose their schemas, but so do the proxy
+  modes' management tools (e.g. `upstream_servers` carries a large multi-field
+  schema). So the name+description-only number is **not** unambiguously
+  conservative — it is its own well-defined metric. The live run below adds full
+  schemas from `GET /api/v1/tools` for the exact headline number; quote that for
+  marketing, not this offline estimate.
+- **Savings scale with tool count.** The 45-tool reference corpus is small; real
+  deployments expose hundreds–thousands of tools, where the baseline grows
+  linearly and the proxy context stays fixed, so savings approach the asymptote.
+  Quote the corpus size alongside any percentage.
+- **`cl100k_base` ≠ the pinned model's tokenizer.** Pinning the exact tokenizer
+  for the headline model is tracked as a follow-up (see "Roadmap").
+
+## What is scoped but not yet built (follow-ups)
+
+These require decisions and/or other roles, so they are tracked as child issues
+rather than landed here:
+
+- **Live run with full schemas + accuracy + latency** — boot mcpproxy over the
+  Spec 065 `snapshot-servers.config.json` (see `docker-compose.yml`), pull
+  `GET /api/v1/tools` for exact schemas, and:
+  - **Accuracy**: replay the Spec 065 retrieval golden set
+    (`retrieval_golden_v1.json`) through `retrieve_tools` and score Recall@k /
+    MRR / nDCG (deterministic, no LLM) — reuses the D1 scorer.
+  - **Latency**: measure proxy-side `retrieve_tools` search latency vs. the
+    fixed cost of loading all tools.
+- **End-to-end task success with a pinned LLM** — requires a pinned model + an
+  LLM-call budget; this is the only part that costs spend.
+- **CI publish-on-release-tag → public static dashboard** — Release/DevOps lane.
+
+## Dataset sources & provenance
+
+- Tool corpus + retrieval golden set: Spec 065 frozen datasets
+  (`specs/065-evaluation-foundation/datasets/`), generated from 7 permissively
+  reachable no-auth reference servers (filesystem, git, memory, sqlite, fetch,
+  time, sequential-thinking).
+- Proxy + management tool definitions: derived at run time from the live server
+  tool builders (`internal/server/mcp_routing.go` →
+  `buildCallToolModeTools` / `buildCodeExecModeTools`, exposed via
+  `internal/server.ProxyModeToolDefs`). No hand-maintained fixture — the
+  benchmark cannot drift from the tools the proxy actually serves.
+
+## Reproducible live run (skeleton)
+
+`docker-compose.yml` boots mcpproxy over the frozen reference-server config so
+the corpus and live tool list are reproducible across machines. Wiring the live
+accuracy/latency scorers into it is the follow-up above.
+
+## Reviewer contact
+
+Methodology questions / disputes: open an issue in `smart-mcp-proxy/mcpproxy-go`
+and tag the maintainers, or comment on the roadmap benchmark ticket (MCP-42).
diff --git a/bench/cmd/bench/main.go b/bench/cmd/bench/main.go
@@ -0,0 +1,52 @@
+// Command bench runs the mcpproxy token-reduction benchmark over a frozen tool
+// corpus and writes a JSON report plus a static HTML dashboard.
+//
+// Usage:
+//
+//	go run ./bench/cmd/bench [-corpus PATH] [-out DIR] [-encoding NAME]
+//
+// With no flags it scores the committed Spec 065 frozen corpus and writes the
+// reports to bench/results/ (gitignored — reports are never committed, per the
+// Spec 065 CN-003 repo rule).
+package main
+
+import (
+	"flag"
+	"fmt"
+	"log"
+	"os"
+
+	"github.com/smart-mcp-proxy/mcpproxy-go/bench"
+)
+
+func main() {
+	corpusPath := flag.String("corpus", "specs/065-evaluation-foundation/datasets/corpus_v1.tools.json", "path to the frozen tool corpus snapshot")
+	outDir := flag.String("out", "bench/results", "output directory for report.json and dashboard.html")
+	encoding := flag.String("encoding", bench.DefaultEncoding, "tiktoken encoding name")
+	flag.Parse()
+
+	tk, err := bench.NewTokenizer(*encoding)
+	if err != nil {
+		log.Fatalf("bench: %v", err)
+	}
+	corpus, err := bench.LoadCorpus(*corpusPath)
+	if err != nil {
+		log.Fatalf("bench: %v", err)
+	}
+
+	report := bench.ComputeReport(tk, corpus)
+	jsonPath, htmlPath, err := report.WriteReports(*outDir)
+	if err != nil {
+		log.Fatalf("bench: %v", err)
+	}
+
+	fmt.Fprintf(os.Stdout, "mcpproxy token-reduction benchmark (corpus %s, %d tools, %s)\n", report.CorpusVersion, report.CorpusTools, report.Encoding)
+	for _, m := range report.Modes {
+		if m.Mode == bench.ModeBaseline {
+			fmt.Fprintf(os.Stdout, "  %-16s %6d tokens (%d tools)  baseline\n", m.Mode, m.Tokens, m.ContextTools)
+			continue
+		}
+		fmt.Fprintf(os.Stdout, "  %-16s %6d tokens (%d tools)  %.1f%% fewer tokens\n", m.Mode, m.Tokens, m.ContextTools, m.SavingsRatio*100)
+	}
+	fmt.Fprintf(os.Stdout, "wrote %s and %s\n", jsonPath, htmlPath)
+}
diff --git a/bench/docker-compose.yml b/bench/docker-compose.yml
@@ -0,0 +1,37 @@
+# Reproducible benchmark substrate (skeleton).
+#
+# Boots mcpproxy over the frozen Spec 065 reference-server config so the tool
+# corpus and live tool list are identical across machines. The live
+# accuracy/latency scorers (see bench/README.md "follow-ups") attach to this.
+#
+# Usage:
+#   docker compose -f bench/docker-compose.yml up --build
+#   # then, against the running proxy on 127.0.0.1:8092:
+#   #   GET /api/v1/tools     -> full tool defs (with schemas) for the live token run
+#   #   retrieve_tools        -> Recall@k accuracy over retrieval_golden_v1.json
+#
+# The committed corpus_v1 snapshot was frozen from exactly this config
+# (specs/065-evaluation-foundation/datasets/README.md), so a live snapshot here
+# reproduces it (modulo upstream-server version drift — pin images before
+# publishing headline numbers).
+services:
+  mcpproxy:
+    build:
+      context: ..
+      dockerfile: Dockerfile
+    command:
+      - serve
+      - --config=/data/snapshot-servers.config.json
+      - --data-dir=/data/state
+      - --listen=0.0.0.0:8092
+    environment:
+      MCPPROXY_API_KEY: eval-corpus-snapshot
+    ports:
+      - "127.0.0.1:8092:8092"
+    volumes:
+      # The frozen, servable reference-server config (7 no-auth servers).
+      - ../specs/065-evaluation-foundation/datasets/snapshot-servers.config.json:/data/snapshot-servers.config.json:ro
+      - bench-state:/data/state
+
+volumes:
+  bench-state:
diff --git a/bench/proxytools.go b/bench/proxytools.go
@@ -0,0 +1,40 @@
+package bench
+
+import (
+	"github.com/smart-mcp-proxy/mcpproxy-go/internal/config"
+	"github.com/smart-mcp-proxy/mcpproxy-go/internal/server"
+)
+
+// ProxyToolsForMode returns the built-in mcpproxy proxy + management tool
+// definitions that occupy the agent's context window in the given routing mode.
+//
+// The catalog is derived directly from the live server tool builders
+// (internal/server.ProxyModeToolDefs → buildCallToolModeTools /
+// buildCodeExecModeTools in internal/server/mcp_routing.go). This is the single
+// source of truth: both routing modes append the shared management tool set
+// (upstream_servers, quarantine_security, search_servers, list_registries), so
+// deriving from the builders guarantees the benchmark counts the real per-mode
+// context cost and can never drift from production by re-introducing the
+// undercount that inflated the headline savings (MCP-3161).
+func ProxyToolsForMode(mode string) []Tool {
+	var routingMode string
+	switch mode {
+	case ModeCodeExecution:
+		routingMode = config.RoutingModeCodeExecution
+	case ModeRetrieveTools:
+		routingMode = config.RoutingModeRetrieveTools
+	default:
+		return nil
+	}
+
+	defs := server.ProxyModeToolDefs(routingMode)
+	out := make([]Tool, 0, len(defs))
+	for _, d := range defs {
+		out = append(out, Tool{
+			ToolID:      "mcpproxy:" + d.Name,
+			Name:        d.Name,
+			Description: d.Description,
+		})
+	}
+	return out
+}
diff --git a/bench/report.go b/bench/report.go
@@ -0,0 +1,105 @@
+package bench
+
+import (
+	"encoding/json"
+	"fmt"
+	"html/template"
+	"os"
+	"path/filepath"
+)
+
+// WriteJSON writes the report as indented JSON to path.
+func (r *Report) WriteJSON(path string) error {
+	data, err := json.MarshalIndent(r, "", "  ")
+	if err != nil {
+		return fmt.Errorf("marshal report: %w", err)
+	}
+	if err := os.WriteFile(path, append(data, '\n'), 0o644); err != nil {
+		return fmt.Errorf("write %q: %w", path, err)
+	}
+	return nil
+}
+
+// WriteHTML renders the report as a self-contained static dashboard. The output
+// is a single file with no external assets so it can be published as-is to a
+// static host (CI release-tag publishing is tracked as a follow-up).
+func (r *Report) WriteHTML(path string) error {
+	tmpl, err := template.New("dashboard").Funcs(template.FuncMap{
+		"pct": func(f float64) string { return fmt.Sprintf("%.1f%%", f*100) },
+	}).Parse(dashboardHTML)
+	if err != nil {
+		return fmt.Errorf("parse template: %w", err)
+	}
+	f, err := os.Create(path)
+	if err != nil {
+		return fmt.Errorf("create %q: %w", path, err)
+	}
+	defer f.Close()
+	if err := tmpl.Execute(f, r); err != nil {
+		return fmt.Errorf("render dashboard: %w", err)
+	}
+	return nil
+}
+
+// WriteReports writes both report.json and dashboard.html into dir.
+func (r *Report) WriteReports(dir string) (jsonPath, htmlPath string, err error) {
+	if err = os.MkdirAll(dir, 0o755); err != nil {
+		return "", "", fmt.Errorf("mkdir %q: %w", dir, err)
+	}
+	jsonPath = filepath.Join(dir, "report.json")
+	htmlPath = filepath.Join(dir, "dashboard.html")
+	if err = r.WriteJSON(jsonPath); err != nil {
+		return "", "", err
+	}
+	if err = r.WriteHTML(htmlPath); err != nil {
+		return "", "", err
+	}
+	return jsonPath, htmlPath, nil
+}
+
+const dashboardHTML = `<!doctype html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>mcpproxy benchmark — token reduction</title>
+<style>
+  :root { color-scheme: light dark; }
+  body { font: 16px/1.5 system-ui, sans-serif; max-width: 880px; margin: 2rem auto; padding: 0 1rem; }
+  h1 { margin-bottom: .25rem; }
+  .sub { opacity: .7; margin-top: 0; }
+  table { border-collapse: collapse; width: 100%; margin: 1.5rem 0; }
+  th, td { padding: .6rem .8rem; text-align: right; border-bottom: 1px solid #8884; }
+  th:first-child, td:first-child { text-align: left; }
+  .save { font-weight: 600; color: #1a8f3c; }
+  code { background: #8881; padding: .1rem .35rem; border-radius: 4px; }
+  .notes { font-size: .9rem; opacity: .8; }
+  .notes li { margin: .3rem 0; }
+</style>
+</head>
+<body>
+<h1>mcpproxy benchmark</h1>
+<p class="sub">Token cost of loading tools into an agent's context, by routing mode.</p>
+<p>Corpus <code>{{.CorpusVersion}}</code> &middot; {{.CorpusTools}} tools &middot; encoding <code>{{.Encoding}}</code></p>
+<table>
+  <thead>
+    <tr><th>Mode</th><th>Tools in context</th><th>Context tokens</th><th>Savings vs. baseline</th></tr>
+  </thead>
+  <tbody>
+  {{range .Modes}}
+    <tr>
+      <td><code>{{.Mode}}</code></td>
+      <td>{{.ContextTools}}</td>
+      <td>{{.Tokens}}</td>
+      <td class="save">{{if eq .Mode "baseline"}}&mdash;{{else}}{{pct .SavingsRatio}}{{end}}</td>
+    </tr>
+  {{end}}
+  </tbody>
+</table>
+<h2>Methodology notes</h2>
+<ul class="notes">
+{{range .Notes}}<li>{{.}}</li>{{end}}
+</ul>
+</body>
+</html>
+`
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Benchmark run artifacts are never committed (Spec 065 CN-003).
		results/