Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- **SWE-bench Pro benchmark**: Multi-language benchmark support (Python, Go, TypeScript, JavaScript) with 731 instances across 11 repositories
- DockerHub-hosted pre-built images via `dockerhub_tag` field
- Official run scripts from `scaleapi/SWE-bench_Pro-os` for per-repo test infrastructure
- Filter by language or repository substring with `--filter-category`
- **Preflight check command**: `mcpbr preflight` validates golden patches pass all tests before evaluation
- Concurrent validation with configurable parallelism (`--max-concurrent`)
- Fail-fast mode (`--fail-fast`) for quick CI checks
- Per-instance and aggregate reporting with language breakdown
- **Case-insensitive test list field access**: `get_test_list_field()` helper supports both SWE-bench (`FAIL_TO_PASS`) and SWE-bench Pro (`fail_to_pass`) conventions
- **Docker image override support**: `_image_override` task field allows benchmarks to specify custom Docker images

### Changed

- **SWE-bench Pro test execution**: Replaced custom language-specific test command building (jest/mocha/go test) with official `run_script.sh` + `parser.py` from `scaleapi/SWE-bench_Pro-os`
- Each of the 11 repos has unique test infrastructure (e.g., Redis for NodeBB, `ansible-test` for ansible, custom runners for tutanota) that the official scripts handle correctly
- Parser runs locally on the host, avoiding Python dependency in Go/JS/TS container images
- Scripts repo is shallow-cloned and cached in `~/.cache/mcpbr/swebench-pro-scripts/`
- Falls back to standard `evaluate_patch()` for Python tasks without official scripts

## [0.14.0] - 2026-02-13

### Added
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
- **Real GitHub issues** from SWE-bench (not toy examples)
- **Reproducible results** via Docker containers with pinned dependencies

> Read the full origin story: **[Why I Built mcpbr](https://greynewell.com/blog/why-i-built-mcpbr/)** — the problem, the approach, and where the project is headed.
## Blog

- [SWE-bench Verified Is Broken: 5 Things I Found in the Source Code](https://greynewell.com/blog/swe-bench-verified-broken-5-things-source-code/)
- [SWE-bench Tests Run 6x Faster on ARM64 with Native Containers](https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/)

## Research Paper

Expand Down Expand Up @@ -1536,4 +1539,4 @@ MIT - see [LICENSE](LICENSE) for details.

---

Built by [Grey Newell](https://greynewell.com) | [Why I Built mcpbr](https://greynewell.com/blog/why-i-built-mcpbr/) | [About](https://mcpbr.org/about/)
Built by [Grey Newell](https://greynewell.com) | [About](https://mcpbr.org/about/)
4 changes: 2 additions & 2 deletions site/pages/about.njk
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ headExtra: |
<p>mcpbr was created by <a href="https://greynewell.com">Grey Newell</a> after identifying a critical gap in the MCP ecosystem: <strong>no tool existed to measure whether an MCP server actually made an AI agent better at its job.</strong></p>
<p>Existing coding benchmarks like SWE-bench measured raw language model capabilities. MCP server developers relied on anecdotal evidence and demo videos. There was no way to answer the fundamental question: <em>does adding this MCP server to an agent improve its performance on real tasks?</em></p>
<p>mcpbr was built to answer that question with hard data.</p>
<blockquote><p>"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."</p><p>&mdash; <a href="https://greynewell.com/blog/why-i-built-mcpbr/">Grey Newell, "Why I Built mcpbr"</a></p></blockquote>
<blockquote><p>"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."</p><p>&mdash; Grey Newell</p></blockquote>

<h2>The Problem mcpbr Solves</h2>
<p>Before mcpbr, MCP server evaluation looked like this:</p>
Expand Down Expand Up @@ -84,7 +84,7 @@ headExtra: |
<tr><td>GitHub</td><td><a href="https://github.com/supermodeltools/mcpbr">github.com/greynewell/mcpbr</a></td></tr>
<tr><td>PyPI</td><td><a href="https://pypi.org/project/mcpbr/">pypi.org/project/mcpbr</a></td></tr>
<tr><td>npm</td><td><a href="https://www.npmjs.com/package/mcpbr-cli">npmjs.com/package/mcpbr-cli</a></td></tr>
<tr><td>Blog Post</td><td><a href="https://greynewell.com/blog/why-i-built-mcpbr/">Why I Built mcpbr</a></td></tr>

<tr><td>Creator</td><td><a href="https://greynewell.com">greynewell.com</a></td></tr>
<tr><td>SchemaFlux</td><td><a href="https://schemaflux.dev">schemaflux.dev</a></td></tr>
<tr><td>License</td><td><a href="https://github.com/supermodeltools/mcpbr/blob/main/LICENSE">MIT</a></td></tr>
Expand Down
Loading
Loading