Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Benchmark run artifacts are never committed (Spec 065 CN-003).
results/
135 changes: 135 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# mcpproxy benchmark harness

The reproducible numbers behind mcpproxy's marketing claims — **token reduction**,
**discovery accuracy**, and **latency** — comparing three ways an agent can be
wired to upstream MCP tools.

> Roadmap item #19 (MCP-42). In-repo (`bench/`), reproducible, intended to be
> refreshed on release. Reports are **never committed** (Spec 065 CN-003); only
> code, fixtures, and this methodology are versioned.

## The three modes

| Mode | What the agent sees in context | mcpproxy server |
|------|--------------------------------|-----------------|
| `baseline` | Every upstream tool definition, loaded directly | (no proxy discovery) |
| `retrieve_tools` | `retrieve_tools` + `call_tool_read/write/destructive` + `read_cache` + `code_execution` + management tools; tools found on demand via BM25 | `callToolServer` |
| `code_execution` | `code_execution` + `retrieve_tools` + management tools; many tools orchestrated from sandboxed JS in one round-trip | `codeExecServer` |

Both proxy modes also append the shared **management tool set** —
`upstream_servers`, `quarantine_security`, `search_servers`, `list_registries`
— that the live routing-mode servers expose. These count against the proxy
context cost: omitting them undercounts that cost and inflates the savings.

The per-mode catalog is **derived directly from the live tool builders**
(`buildCallToolModeTools` / `buildCodeExecModeTools` in
`internal/server/mcp_routing.go`, via `server.ProxyModeToolDefs`), so it can
never drift from production.

## What ships today (deterministic, offline)

The **token-reduction** measurement is fully deterministic and runs with no
network or LLM:

```bash
go run ./bench/cmd/bench # scores the committed Spec 065 corpus
go test ./bench/ # unit + invariant tests
```

It counts the context-token cost of each mode over a **frozen tool corpus** and
reports the savings of each proxy mode versus the baseline. Output: a
`report.json` and a self-contained `dashboard.html` in `bench/results/`
(gitignored).

#### Current deterministic result

Over the 45-tool Spec 065 reference corpus, counting **tool name + description
only** (schemas excluded uniformly — see limitations), `cl100k_base`:

| Mode | Context tools | Tokens | Savings vs. baseline |
|------|---------------|--------|----------------------|
| `baseline` | 45 | 1730 | — |
| `retrieve_tools` | 10 | 1431 | **~17%** |
| `code_execution` | 6 | 986 | **~43%** |

These are deliberately modest: the proxy context here is the *full* per-mode
tool set (discovery + call-tool variants + management tools), and the corpus is
small. Savings grow toward the asymptote as the upstream tool count rises (the
baseline grows linearly while the proxy context stays fixed) — always quote the
corpus size alongside a percentage. Reproduce with `go run ./bench/cmd/bench`.

### Scoring rubric — token reduction

- **Tool universe**: the frozen Spec 065 snapshot
`specs/065-evaluation-foundation/datasets/corpus_v1.tools.json` — 45 tools
across 7 no-auth reference servers. Frozen + versioned so scoring never runs
against a drifting corpus (CN-002).
- **Tokenizer**: `tiktoken cl100k_base`, a widely-used reproducible BPE
(already a repo dependency). It is a **model-agnostic estimator**; exact
counts for a specific pinned model (e.g. Claude) will differ, but the
*relative* savings between modes are stable.
- **Proxy-mode tools**: the *complete* per-mode catalog, derived from the live
server builders — discovery, the call-tool variants, `code_execution`, **and
the shared management tool set** (`upstream_servers`, `quarantine_security`,
`search_servers`, `list_registries`). Nothing the agent actually sees is
dropped from the proxy cost.
- **Cost of a tool**: `name + "\n" + description`. JSON input schemas are
excluded **uniformly** across all modes (the committed corpus snapshot does
not carry schemas).
- **Savings** for a mode `m`: `1 - tokens(m) / tokens(baseline)`.

### Known limitations (read before quoting a number)

- **Schemas excluded — direction is not clean.** Input schemas are dropped from
*both* sides. The 45 baseline tools lose their schemas, but so do the proxy
modes' management tools (e.g. `upstream_servers` carries a large multi-field
schema). So the name+description-only number is **not** unambiguously
conservative — it is its own well-defined metric. The live run below adds full
schemas from `GET /api/v1/tools` for the exact headline number; quote that for
marketing, not this offline estimate.
- **Savings scale with tool count.** The 45-tool reference corpus is small; real
deployments expose hundreds–thousands of tools, where the baseline grows
linearly and the proxy context stays fixed, so savings approach the asymptote.
Quote the corpus size alongside any percentage.
- **`cl100k_base` ≠ the pinned model's tokenizer.** Pinning the exact tokenizer
for the headline model is tracked as a follow-up (see "Roadmap").

## What is scoped but not yet built (follow-ups)

These require decisions and/or other roles, so they are tracked as child issues
rather than landed here:

- **Live run with full schemas + accuracy + latency** — boot mcpproxy over the
Spec 065 `snapshot-servers.config.json` (see `docker-compose.yml`), pull
`GET /api/v1/tools` for exact schemas, and:
- **Accuracy**: replay the Spec 065 retrieval golden set
(`retrieval_golden_v1.json`) through `retrieve_tools` and score Recall@k /
MRR / nDCG (deterministic, no LLM) — reuses the D1 scorer.
- **Latency**: measure proxy-side `retrieve_tools` search latency vs. the
fixed cost of loading all tools.
- **End-to-end task success with a pinned LLM** — requires a pinned model + an
LLM-call budget; this is the only part that costs spend.
- **CI publish-on-release-tag → public static dashboard** — Release/DevOps lane.

## Dataset sources & provenance

- Tool corpus + retrieval golden set: Spec 065 frozen datasets
(`specs/065-evaluation-foundation/datasets/`), generated from 7 permissively
reachable no-auth reference servers (filesystem, git, memory, sqlite, fetch,
time, sequential-thinking).
- Proxy + management tool definitions: derived at run time from the live server
tool builders (`internal/server/mcp_routing.go` →
`buildCallToolModeTools` / `buildCodeExecModeTools`, exposed via
`internal/server.ProxyModeToolDefs`). No hand-maintained fixture — the
benchmark cannot drift from the tools the proxy actually serves.

## Reproducible live run (skeleton)

`docker-compose.yml` boots mcpproxy over the frozen reference-server config so
the corpus and live tool list are reproducible across machines. Wiring the live
accuracy/latency scorers into it is the follow-up above.

## Reviewer contact

Methodology questions / disputes: open an issue in `smart-mcp-proxy/mcpproxy-go`
and tag the maintainers, or comment on the roadmap benchmark ticket (MCP-42).
52 changes: 52 additions & 0 deletions bench/cmd/bench/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
// Command bench runs the mcpproxy token-reduction benchmark over a frozen tool
// corpus and writes a JSON report plus a static HTML dashboard.
//
// Usage:
//
// go run ./bench/cmd/bench [-corpus PATH] [-out DIR] [-encoding NAME]
//
// With no flags it scores the committed Spec 065 frozen corpus and writes the
// reports to bench/results/ (gitignored — reports are never committed, per the
// Spec 065 CN-003 repo rule).
package main

import (
"flag"
"fmt"
"log"
"os"

"github.com/smart-mcp-proxy/mcpproxy-go/bench"
)

func main() {
corpusPath := flag.String("corpus", "specs/065-evaluation-foundation/datasets/corpus_v1.tools.json", "path to the frozen tool corpus snapshot")
outDir := flag.String("out", "bench/results", "output directory for report.json and dashboard.html")
encoding := flag.String("encoding", bench.DefaultEncoding, "tiktoken encoding name")
flag.Parse()

tk, err := bench.NewTokenizer(*encoding)
if err != nil {
log.Fatalf("bench: %v", err)
}
corpus, err := bench.LoadCorpus(*corpusPath)
if err != nil {
log.Fatalf("bench: %v", err)
}

report := bench.ComputeReport(tk, corpus)
jsonPath, htmlPath, err := report.WriteReports(*outDir)
if err != nil {
log.Fatalf("bench: %v", err)
}

fmt.Fprintf(os.Stdout, "mcpproxy token-reduction benchmark (corpus %s, %d tools, %s)\n", report.CorpusVersion, report.CorpusTools, report.Encoding)
for _, m := range report.Modes {
if m.Mode == bench.ModeBaseline {
fmt.Fprintf(os.Stdout, " %-16s %6d tokens (%d tools) baseline\n", m.Mode, m.Tokens, m.ContextTools)
continue
}
fmt.Fprintf(os.Stdout, " %-16s %6d tokens (%d tools) %.1f%% fewer tokens\n", m.Mode, m.Tokens, m.ContextTools, m.SavingsRatio*100)
}
fmt.Fprintf(os.Stdout, "wrote %s and %s\n", jsonPath, htmlPath)
}
37 changes: 37 additions & 0 deletions bench/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Reproducible benchmark substrate (skeleton).
#
# Boots mcpproxy over the frozen Spec 065 reference-server config so the tool
# corpus and live tool list are identical across machines. The live
# accuracy/latency scorers (see bench/README.md "follow-ups") attach to this.
#
# Usage:
# docker compose -f bench/docker-compose.yml up --build
# # then, against the running proxy on 127.0.0.1:8092:
# # GET /api/v1/tools -> full tool defs (with schemas) for the live token run
# # retrieve_tools -> Recall@k accuracy over retrieval_golden_v1.json
#
# The committed corpus_v1 snapshot was frozen from exactly this config
# (specs/065-evaluation-foundation/datasets/README.md), so a live snapshot here
# reproduces it (modulo upstream-server version drift — pin images before
# publishing headline numbers).
services:
mcpproxy:
build:
context: ..
dockerfile: Dockerfile
command:
- serve
- --config=/data/snapshot-servers.config.json
- --data-dir=/data/state
- --listen=0.0.0.0:8092
environment:
MCPPROXY_API_KEY: eval-corpus-snapshot
ports:
- "127.0.0.1:8092:8092"
volumes:
# The frozen, servable reference-server config (7 no-auth servers).
- ../specs/065-evaluation-foundation/datasets/snapshot-servers.config.json:/data/snapshot-servers.config.json:ro
- bench-state:/data/state

volumes:
bench-state:
40 changes: 40 additions & 0 deletions bench/proxytools.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package bench

import (
"github.com/smart-mcp-proxy/mcpproxy-go/internal/config"
"github.com/smart-mcp-proxy/mcpproxy-go/internal/server"
)

// ProxyToolsForMode returns the built-in mcpproxy proxy + management tool
// definitions that occupy the agent's context window in the given routing mode.
//
// The catalog is derived directly from the live server tool builders
// (internal/server.ProxyModeToolDefs → buildCallToolModeTools /
// buildCodeExecModeTools in internal/server/mcp_routing.go). This is the single
// source of truth: both routing modes append the shared management tool set
// (upstream_servers, quarantine_security, search_servers, list_registries), so
// deriving from the builders guarantees the benchmark counts the real per-mode
// context cost and can never drift from production by re-introducing the
// undercount that inflated the headline savings (MCP-3161).
func ProxyToolsForMode(mode string) []Tool {
var routingMode string
switch mode {
case ModeCodeExecution:
routingMode = config.RoutingModeCodeExecution
case ModeRetrieveTools:
routingMode = config.RoutingModeRetrieveTools
default:
return nil
}

defs := server.ProxyModeToolDefs(routingMode)
out := make([]Tool, 0, len(defs))
for _, d := range defs {
out = append(out, Tool{
ToolID: "mcpproxy:" + d.Name,
Name: d.Name,
Description: d.Description,
})
}
return out
}
105 changes: 105 additions & 0 deletions bench/report.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
package bench

import (
"encoding/json"
"fmt"
"html/template"
"os"
"path/filepath"
)

// WriteJSON writes the report as indented JSON to path.
func (r *Report) WriteJSON(path string) error {
data, err := json.MarshalIndent(r, "", " ")
if err != nil {
return fmt.Errorf("marshal report: %w", err)
}
if err := os.WriteFile(path, append(data, '\n'), 0o644); err != nil {
return fmt.Errorf("write %q: %w", path, err)
}
return nil
}

// WriteHTML renders the report as a self-contained static dashboard. The output
// is a single file with no external assets so it can be published as-is to a
// static host (CI release-tag publishing is tracked as a follow-up).
func (r *Report) WriteHTML(path string) error {
tmpl, err := template.New("dashboard").Funcs(template.FuncMap{
"pct": func(f float64) string { return fmt.Sprintf("%.1f%%", f*100) },
}).Parse(dashboardHTML)
if err != nil {
return fmt.Errorf("parse template: %w", err)
}
f, err := os.Create(path)
if err != nil {
return fmt.Errorf("create %q: %w", path, err)
}
defer f.Close()
if err := tmpl.Execute(f, r); err != nil {
return fmt.Errorf("render dashboard: %w", err)
}
return nil
}

// WriteReports writes both report.json and dashboard.html into dir.
func (r *Report) WriteReports(dir string) (jsonPath, htmlPath string, err error) {
if err = os.MkdirAll(dir, 0o755); err != nil {
return "", "", fmt.Errorf("mkdir %q: %w", dir, err)
}
jsonPath = filepath.Join(dir, "report.json")
htmlPath = filepath.Join(dir, "dashboard.html")
if err = r.WriteJSON(jsonPath); err != nil {
return "", "", err
}
if err = r.WriteHTML(htmlPath); err != nil {
return "", "", err
}
return jsonPath, htmlPath, nil
}

const dashboardHTML = `<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>mcpproxy benchmark — token reduction</title>
<style>
:root { color-scheme: light dark; }
body { font: 16px/1.5 system-ui, sans-serif; max-width: 880px; margin: 2rem auto; padding: 0 1rem; }
h1 { margin-bottom: .25rem; }
.sub { opacity: .7; margin-top: 0; }
table { border-collapse: collapse; width: 100%; margin: 1.5rem 0; }
th, td { padding: .6rem .8rem; text-align: right; border-bottom: 1px solid #8884; }
th:first-child, td:first-child { text-align: left; }
.save { font-weight: 600; color: #1a8f3c; }
code { background: #8881; padding: .1rem .35rem; border-radius: 4px; }
.notes { font-size: .9rem; opacity: .8; }
.notes li { margin: .3rem 0; }
</style>
</head>
<body>
<h1>mcpproxy benchmark</h1>
<p class="sub">Token cost of loading tools into an agent's context, by routing mode.</p>
<p>Corpus <code>{{.CorpusVersion}}</code> &middot; {{.CorpusTools}} tools &middot; encoding <code>{{.Encoding}}</code></p>
<table>
<thead>
<tr><th>Mode</th><th>Tools in context</th><th>Context tokens</th><th>Savings vs. baseline</th></tr>
</thead>
<tbody>
{{range .Modes}}
<tr>
<td><code>{{.Mode}}</code></td>
<td>{{.ContextTools}}</td>
<td>{{.Tokens}}</td>
<td class="save">{{if eq .Mode "baseline"}}&mdash;{{else}}{{pct .SavingsRatio}}{{end}}</td>
</tr>
{{end}}
</tbody>
</table>
<h2>Methodology notes</h2>
<ul class="notes">
{{range .Notes}}<li>{{.}}</li>{{end}}
</ul>
</body>
</html>
`
Loading
Loading