supermodeltools · greynewell · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- **SWE-bench Pro benchmark**: Multi-language benchmark support (Python, Go, TypeScript, JavaScript) with 731 instances across 11 repositories
+  - DockerHub-hosted pre-built images via `dockerhub_tag` field
+  - Official run scripts from `scaleapi/SWE-bench_Pro-os` for per-repo test infrastructure
+  - Filter by language or repository substring with `--filter-category`
+- **Preflight check command**: `mcpbr preflight` validates golden patches pass all tests before evaluation
+  - Concurrent validation with configurable parallelism (`--max-concurrent`)
+  - Fail-fast mode (`--fail-fast`) for quick CI checks
+  - Per-instance and aggregate reporting with language breakdown
+- **Case-insensitive test list field access**: `get_test_list_field()` helper supports both SWE-bench (`FAIL_TO_PASS`) and SWE-bench Pro (`fail_to_pass`) conventions
+- **Docker image override support**: `_image_override` task field allows benchmarks to specify custom Docker images
+
+### Changed
+
+- **SWE-bench Pro test execution**: Replaced custom language-specific test command building (jest/mocha/go test) with official `run_script.sh` + `parser.py` from `scaleapi/SWE-bench_Pro-os`
+  - Each of the 11 repos has unique test infrastructure (e.g., Redis for NodeBB, `ansible-test` for ansible, custom runners for tutanota) that the official scripts handle correctly
+  - Parser runs locally on the host, avoiding Python dependency in Go/JS/TS container images
+  - Scripts repo is shallow-cloned and cached in `~/.cache/mcpbr/swebench-pro-scripts/`
+  - Falls back to standard `evaluate_patch()` for Python tasks without official scripts
+
 ## [0.14.0] - 2026-02-13
 
 ### Added

diff --git a/README.md b/README.md
@@ -55,7 +55,10 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
 - **Real GitHub issues** from SWE-bench (not toy examples)
 - **Reproducible results** via Docker containers with pinned dependencies
 
-> Read the full origin story: **[Why I Built mcpbr](https://greynewell.com/blog/why-i-built-mcpbr/)** — the problem, the approach, and where the project is headed.
+## Blog
+
+- [SWE-bench Verified Is Broken: 5 Things I Found in the Source Code](https://greynewell.com/blog/swe-bench-verified-broken-5-things-source-code/)
+- [SWE-bench Tests Run 6x Faster on ARM64 with Native Containers](https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/)
 
 ## Research Paper
 
@@ -1536,4 +1539,4 @@ MIT - see [LICENSE](LICENSE) for details.
 
 ---
 
-Built by [Grey Newell](https://greynewell.com) | [Why I Built mcpbr](https://greynewell.com/blog/why-i-built-mcpbr/) | [About](https://mcpbr.org/about/)
+Built by [Grey Newell](https://greynewell.com) | [About](https://mcpbr.org/about/)
diff --git a/site/pages/about.njk b/site/pages/about.njk
@@ -49,7 +49,7 @@ headExtra: |
 <p>mcpbr was created by <a href="https://greynewell.com">Grey Newell</a> after identifying a critical gap in the MCP ecosystem: <strong>no tool existed to measure whether an MCP server actually made an AI agent better at its job.</strong></p>
 <p>Existing coding benchmarks like SWE-bench measured raw language model capabilities. MCP server developers relied on anecdotal evidence and demo videos. There was no way to answer the fundamental question: <em>does adding this MCP server to an agent improve its performance on real tasks?</em></p>
 <p>mcpbr was built to answer that question with hard data.</p>
-<blockquote><p>"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."</p><p>&mdash; <a href="https://greynewell.com/blog/why-i-built-mcpbr/">Grey Newell, "Why I Built mcpbr"</a></p></blockquote>
+<blockquote><p>"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."</p><p>&mdash; Grey Newell</p></blockquote>
 
 <h2>The Problem mcpbr Solves</h2>
 <p>Before mcpbr, MCP server evaluation looked like this:</p>
@@ -84,7 +84,7 @@ headExtra: |
   <tr><td>GitHub</td><td><a href="https://github.com/supermodeltools/mcpbr">github.com/greynewell/mcpbr</a></td></tr>
   <tr><td>PyPI</td><td><a href="https://pypi.org/project/mcpbr/">pypi.org/project/mcpbr</a></td></tr>
   <tr><td>npm</td><td><a href="https://www.npmjs.com/package/mcpbr-cli">npmjs.com/package/mcpbr-cli</a></td></tr>
-  <tr><td>Blog Post</td><td><a href="https://greynewell.com/blog/why-i-built-mcpbr/">Why I Built mcpbr</a></td></tr>
+
   <tr><td>Creator</td><td><a href="https://greynewell.com">greynewell.com</a></td></tr>
   <tr><td>SchemaFlux</td><td><a href="https://schemaflux.dev">schemaflux.dev</a></td></tr>
   <tr><td>License</td><td><a href="https://github.com/supermodeltools/mcpbr/blob/main/LICENSE">MIT</a></td></tr>