All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
-
Remove GT and analysis caching from SupermodelBenchmark (supermodeltools/supermodel-public-api#714): Both caches had no invalidation mechanism, causing stale data to persist silently across runs. The GT cache bypassed all fixes applied to
extract_ground_truth(FP filters, pattern additions). The analysis cache was keyed by zip hash, so server-side idempotency key version bumps did not bust it — the server was never reached and old results were served indefinitely. Both caches are now removed. GT extraction is a single GitHub API call (cheap). The Supermodel API handles server-side deduplication via the idempotency key. Also removes theDEFAULT_GT_DIRconstant,ground_truth_dirconstructor parameter, and thecached_analysistask config field — all of which existed solely to support the now-removed caching paths. -
Dead code benchmark: filter feature-removal false positives from ground truth (supermodeltools/supermodel-public-api#714): The ground truth extractor now applies the existing
_is_feature_removal_fpfilter (which was implemented but never called). Symbols deleted in a PR that are also imported by other files deleted in the same PR are excluded from GT — they were live code co-removed with their consumers, not dead code. Genuinely orphaned symbols with no deleted importer are kept. This fixes 0-recall scores for PRs like n8n #23572 and prisma #28485 where whole files were removed as part of a feature deletion.
0.14.0 - 2026-02-13
- Strict code quality enforcement: Expanded Ruff linting rules (B, UP, SIM, RUF, C4, PIE, PT,
ASYNC, S, T20) and added mypy type checking with Pydantic plugin across all 134 source files
- Added mypy pre-commit hook and CI type-check job
- Zero ruff violations (72 fixed across 36 files)
- Zero mypy errors (267 fixed across 39 files)
- All 4293 tests pass with no regressions
- 72 ruff lint violations across 36 files: B904 (raise-without-from), SIM102/SIM105/SIM115/ SIM116/SIM117 (simplifications), RUF059/RUF003 (unused vars, Unicode), B007 (unused loop vars), PT019 (pytest fixtures), S-rules (security: S310 URL validation, S108 temp dirs, S311 non-crypto random, S110 exception handling, S608 SQL, S112 try-except-continue, S104 binding, S602 shell)
- 267 mypy type errors across 39 files: union-attr (128), assignment (33), no-any-return (28), arg-type (23), and others. Fixed with proper type narrowing, assertions, annotations, and type-safe patterns across infrastructure providers (GCP, AWS, Azure, Cloudflare, K8s), core modules (harness, CLI, docker_env), and utility modules (providers, notifications, benchmarks)
- Zombie Docker containers (#439): Containers left running indefinitely after task completion
(observed 15+ hours in a 500-task eval).
TaskEnvironment.cleanup()now retriescontainer.remove()up to 3 times with backoff, logs warnings instead of silently swallowing errors, and handlesdocker.errors.NotFoundgracefully. Successful cleanup removes the container from the tracking list; failed cleanup keeps it forcleanup_all_syncto retry at exit. Newcleanup_stale_session_containers()method removes orphans from previous sessions at startup. - setup_command cache not found by MCP server (#440):
basename("/workspace")returned"workspace"which didn't match the repo name used as cache key, causing 5/446 tasks (1.1%) to time out with 0 iterations. Addedrepo,base_commitfields andrepo_nameproperty toTaskEnvironment. InjectsMCPBR_REPO,MCPBR_REPO_NAME,MCPBR_BASE_COMMIT,MCPBR_INSTANCE_IDenv vars intorun_setup_command,_solve_in_dockerenv file, and.mcp.jsonenv block so the MCP server can locate cached data by repo name.
- Config placeholder expansion (#440):
{repo},{repo_name},{base_commit}, and{instance_id}placeholders inmcp_server.argsandmcp_server.setup_commandare now expanded automatically, in addition to the existing{workdir}
- Config
task_idsignored for local runs (#458): Falls back toconfig.task_idswhen no CLI--taskflags are provided - Progress output unparseable when piped (#459): Emits plain-text
[START]/[DONE]/[ERROR]/[SKIP]/[SUMMARY]lines to stderr when not connected to a TTY - Containers OOM-kill each other under concurrency (#460): Adds
mem_limit=4g/memswap_limit=6gdefaults to container creation
- Conda testbed environment not activated for agent (#457): Prebuilt SWE-bench Docker images
install project dependencies in a conda
testbedenvironment, but the agent ran with system Python which lacked these packages. The agent's self-verification attempts (e.g.python -m pytest) failed withModuleNotFoundError. Now prepends conda testbed activation to the env file sourced before launching Claude, so the agent and its Bash tool children inherit the correct PATH
0.12.6 - 2026-02-13
- Updated package author from "mcpbr Contributors" to "Grey Newell" across PyPI, npm, and Claude plugin metadata
- Changed homepage URL from GitHub to mcpbr.org in all package registries
- Added author website URL to PyPI project URLs
0.12.5 - 2026-02-12
- MCP native tool blocking (#450): New
disallowed_native_toolsandsystem_prompt_appendfields inmcp_serverconfig. When set,--disallowedToolsstrips specified native tools (e.g., Read, Grep, Glob) from the API tool definitions, forcing the agent to use MCP equivalents.--append-system-promptcan guide the agent toward MCP tools. Essential for meaningful MCP vs native tool benchmarking. - Claude Code version pinning (#453): New
claude_code_versionconfig field lets you pin a specific Claude Code version inside Docker containers (e.g.,claude_code_version: "2.1.37"). Useful for version comparison experiments. Older versions that lack--max-turnssupport (pre-2.1.x) are automatically detected and the flag is omitted. - Eval started notifications (#413): Notifications are now sent when an evaluation begins, including benchmark name, model, task count, concurrency, and infrastructure mode
- Infrastructure lifecycle notifications (#414):
infra_provisionedandinfra_teardownevents notify when cloud VMs are provisioned (with SSH command) and torn down - Progress notifications (#415): Periodic progress updates during long-running evaluations,
configurable via
notify_progress_interval(every N tasks) andnotify_progress_time_minutes(every N minutes). Includes completed/total counts, elapsed time, ETA, and running cost - Failure notifications (#416): Immediate notification when an evaluation fails with error details, completed task count, and last successful task ID
- Cryptography dependency update (#447): Bumped
cryptographyfrom 46.0.4 to 46.0.5 to address CVE-2026-26007 (malicious public key could reveal private key portions on binary elliptic curves) - Sandbox seccomp default-deny allowlist (#417): Strict sandbox mode now uses a seccomp allowlist (default-deny) instead of a blocklist, preventing bypass via unlisted syscalls
- Shell injection prevention in infrastructure providers (#421, #422):
python_versionand env key names are now validated with strict regex before interpolation into shell commands in AWS and GCP providers. SSH firewall rules raiseRuntimeErrorinstead of falling back to0.0.0.0/0when IP detection fails. - Audit HMAC chain tamper protection (#419): Audit log entries now use a secret HMAC key and chain on recomputed checksums, making the tamper-evident chain resistant to forgery
- API authentication, CORS headers, and limit validation (#420): REST API now supports
bearer token authentication, sends security headers (
X-Content-Type-Options: nosniff), warns when binding to0.0.0.0, and validates the?limit=query parameter - Plugin registry SSRF prevention (#428): Registry URL scheme is now validated (must be
https://) and response body size is limited to prevent SSRF and resource exhaustion - Prompt scanner multi-match fix (#423): Prompt security scanner now checks all regex matches per pattern instead of stopping at the first match, preventing evasion by placing benign content before malicious payloads
- Cloudflare Worker auth enforcement (#426): Worker deployments now require authentication by default. A secure token is auto-generated if none is provided in the config.
- Safer K8s DinD defaults (#426): Docker-in-Docker sidecar uses rootless Docker image
(
docker:27-dind-rootless) instead of--privilegedby default. Privileged mode can be explicitly enabled viadind_privileged: true. - Sandbox strict mode validation (#427):
validate_sandbox()now raisesValueErrorin strict mode when container settings don't match the security profile, instead of only logging warnings
- Default disk size increased to 1 TB (#434):
disk_gbdefault changed from 250 to 1000 across AWS, GCP, and Azure providers to accommodate swe-bench-verified evaluations
- Python 3.13 and 3.14 CI support: Added Python 3.13 and 3.14 to the CI test matrix and declared support in pyproject.toml classifiers
- Azure SSH-resilient eval execution (#446): Remote evaluation on Azure VMs now runs via
nohup setsidso the process survives SSH disconnects (laptop sleep, network hiccups). The local orchestrator polls the log file and automatically reconnects if the SSH session drops, with a 24-hour deadline and max 10 reconnect attempts - Azure Slack notifications (#446): Azure VMs now install
mcpbr[slack]instead of baremcpbr, soslack_sdkis available for lifecycle notifications on remote runs - Optional slack_sdk test resilience: Slack bot notification tests now skip gracefully when
slack_sdkis not installed, instead of failing withModuleNotFoundError - MCP log directory permission error (#451): Claude CLI running as non-root
mcpbruser could fail withEACCESwhen creating MCP log directories under~/.cache/. The home directory and.cachesubdirectory are now explicitly created and owned by themcpbruser during container setup - K8s async context crash (#424): Replaced
run_until_completewithawaitin K8s provider async methods, fixing crashes when called from an existing event loop - SQLite UPSERT timestamp preservation (#425): Changed
INSERT OR REPLACEtoINSERT ... ON CONFLICT DO UPDATEsocreated_attimestamps are preserved on re-runs - Infrastructure resource leak fixes (#430): SFTP connections in AWS/GCP now close in
finallyblocks; Kubernetessetup()cleans up partial ConfigMaps/Secrets on failure - PII pattern coverage (#431): Added abbreviated IPv6, SSN without dashes, Amex credit cards, and international phone number detection to the privacy module
- Distributed coordinator reliability (#432): Added per-worker timeouts via
asyncio.wait_for(), config isolation withcopy.deepcopy(),asyncio.Lockfor shared state safety, andDistributedExecutionErrorwith fail-fast error propagation - Cloud storage error handling (#429): Added retry logic with exponential backoff,
CloudStorageErrorfor timeout/auth errors, partial file cleanup on failed downloads, and CLI-not-found handling for GCS/Azure - network_allowlist warning (#418): Logs a warning when
network_allowlistis configured but not yet enforced at runtime, instead of silently ignoring the setting
0.12.4 - 2026-02-09
- Eval lifecycle notifications (#438): Notifications are now sent when an evaluation begins, including benchmark name, model, task count, concurrency, and infrastructure mode
- Setup-only mode (#445): New
setup_onlyflag for static file generation workflows that skip evaluation after environment setup
0.12.3 - 2026-02-07
- Version sync fix: Fixed
__version__in__init__.pybeing stuck at0.12.0, which caused Azure VMs to install the wrong mcpbr version. Added__init__.pyto thesync_version.pypre-commit hook so versions stay in sync automatically
0.12.2 - 2026-02-07
- Incremental cloud storage uploads (#435, #436): Upload evaluation results to cloud storage
(Azure Blob, S3, GCS) after each task completes, not just at the end. Prevents total data loss
when VMs crash or are terminated mid-evaluation. Non-blocking background uploads via
_upload_to_cloud_async(). Wiredincremental_save_pathfrom CLI to harness
- v0.12.0 security and reliability audit (#433): 16 issues fixed across the codebase
0.12.1 - 2026-02-07
- SUPERMODEL_CACHE_DIR wiring (#412): Wire up
SUPERMODEL_CACHE_DIRin supermodel config
0.12.0 - 2026-02-07
- Docker sandbox enhancements (#381): Extended sandbox security profiles with new isolation features
- User namespace remapping (
userns_mode) for container user isolation - I/O rate limits (
device_read_bps,device_write_bps) for disk throughput control - Network allowlist (
network_allowlist) for fine-grained network access control - Sandbox validation (
validate_sandbox()) to verify container settings match the profile - Per-benchmark sandbox defaults: CyberGym and Adversarial benchmarks default to strict mode
- Audit logging for sandbox events (
SANDBOX_APPLIED,SANDBOX_VALIDATED) - Strict profile now enables user namespace remapping by default
- User namespace remapping (
- Prompt security scanning (#108): Detect prompt injection attacks in benchmark tasks
- 10 built-in detection patterns: instruction override, identity injection, system prompt leak, jailbreak prefixes (DAN/STAN/DUDE/AIM), role-play injection, base64 injection, unicode obfuscation, shell injection, reverse shell, and delimiter injection
- Configurable scan levels:
full(all patterns) orminimal(HIGH/CRITICAL only) - Three action modes:
audit(log only),warn(log warnings),block(skip task) - Custom pattern support via YAML configuration
- Allowlist patterns to suppress false positives
- Audit logging for scan findings (
SECURITY_SCAN_FINDING,SECURITY_SCAN_BLOCKED) - New
prompt_securityconfiguration section in YAML
- Misleading 'no changes detected' error (#409): When an agent wrote changes that were later reverted, the error field incorrectly reported "no changes detected." Now reports "working tree is clean - changes were likely reverted" with tool call counts, and suppresses the error entirely when a patch was generated and evaluated.
- About page: New docs/about.md page with the mcpbr origin story, project vision, community links, and prominent links to the Why I Built mcpbr blog post
- Testing Philosophy page: New docs/philosophy.md page expanding on the principle that "MCP servers should be tested like APIs, not like plugins" with detailed evaluation design guidance
- Blog post integration: Added links to the origin story blog post across README, docs homepage, FAQ, architecture, MCP integration, best practices, and contributing pages
- Navigation update: Added About section to mkdocs.yml with About and Testing Philosophy pages
- Retry empty workspace after Docker copy (#405): Under high concurrency, the Docker
filesystem copy from
/testbedto/workspacecan silently produce an empty workspace. Now retries with a sync and, if necessary, a full re-copy before raising an error. - setup_command runs as mcpbr user in Docker (#386): setup_command now executes as the mcpbr user instead of root, preventing EACCES permission errors when the agent accesses files created during setup (e.g., npm cache)
- Workspace verification after copy (#387): Added filesystem sync and file-count check after copying the repo to /workspace, raising an early error if the workspace is empty
- setup_command failures are always visible (#388): Non-zero exit codes from
setup_command now always print a warning, not just in
--verbosemode - Azure health check timeout (#375): Increased Azure quota health check timeout from 60s to 120s (configurable) and made quota check non-fatal with warnings instead of errors
- CLI task filters not forwarded (#367): Task IDs specified via
-tCLI flags are now correctly forwarded to remote infrastructure providers during evaluation
- Sandbox execution environment (#109): Configurable Docker security profiles for
benchmark evaluation containers
- Three security levels: permissive, standard (default), and strict
- Seccomp profile generation with syscall allow/deny lists
- Network, filesystem, and capability controls per security level
- YAML config integration via
sandboxsection
- Database backend (#86): SQLite storage backend for persistent evaluation results
- Async
SQLiteStoragewith automatic schema creation (runs + task_results tables) - Store, query, list, and delete evaluation runs
- Task-level result storage with JSON serialization
StorageBackendabstract base class for future backends
- Async
- Cloud storage support (#87): Upload results and artifacts to cloud object storage
S3Storagefor AWS S3 with presigned URL generationGCSStoragefor Google Cloud StorageAzureBlobStoragefor Azure Blob Storage with SAS URL generationcreate_cloud_storage()factory for provider selection- CLI tool-based implementation (no SDK dependencies)
- AWS EC2 infrastructure provider (#352): Run evaluations on AWS EC2 instances
- Automatic instance provisioning with configurable instance types and AMIs
- SSH connectivity via paramiko with key management
- Remote Docker/Python/Node.js dependency installation
- Health checks for AWS CLI, authentication, and instance type availability
- Full lifecycle: setup, evaluation, artifact collection, cleanup
- GCP Compute Engine infrastructure provider (#353): Run evaluations on GCP VMs
- Instance provisioning via
gcloud computeCLI - SSH connectivity with GCP-managed keys
- Health checks for gcloud CLI, authentication, project, and zone configuration
- Configurable machine types, disk sizes, and network settings
- Instance provisioning via
- Kubernetes infrastructure provider (#355): Run evaluations as Kubernetes Jobs
- Job manifest generation with configurable resource requests/limits
- ConfigMap and Secret creation for benchmark configuration
- Pod log streaming and artifact collection
- Health checks for kubectl, cluster access, namespace, and resource quotas
- Automatic cleanup of Jobs, ConfigMaps, and Secrets
- Cloudflare Workers infrastructure provider (#354): Hybrid Worker-based evaluation
- Wrangler CLI integration for Worker deployment
- KV namespace management for result storage
- Worker script generation with benchmark compatibility checks
- Health checks for wrangler CLI, authentication, and Node.js
- Supports custom and HumanEval benchmarks via Worker execution
- Distributed execution coordinator (#116): Coordinate running evaluations across multiple
workers/machines with automatic task partitioning and result merging
TaskPartitionerwith even round-robin and difficulty-balanced (LPT) partitioning strategiesWorkerResultdataclass for per-worker execution results with timing and error trackingDistributedCoordinatorthat partitions tasks, launches workers concurrently viaasyncio.gather, and merges results into a unifiedEvaluationResultsobject- Supports all infrastructure providers: local, AWS, GCP, Kubernetes, Azure
- Multi-benchmark runner (#359): Run multiple benchmarks in a single invocation with parallel
execution and aggregated results
MultiBenchmarkRunnerwith asyncio semaphore-based concurrency controlBenchmarkRun,BenchmarkResult, andMultiBenchmarkResultsdataclasses for configuration and result aggregationpartition_tasksutility for sharding task lists across runners- Summary property with aggregate pass rate, task counts, and duration
- REST API server (#83): HTTP API for querying evaluation results and server status
GET /api/v1/health— server health checkGET /api/v1/runs— list all evaluation runsGET /api/v1/runs/{id}— get a specific run with resultsGET /api/v1/runs/{id}/tasks— get task-level results for a runDELETE /api/v1/runs/{id}— delete an evaluation runGET /api/v1/stats— aggregate statistics across all runs- Built on Python
http.server(no external dependencies) mcpbr serveCLI command for starting the API server
- Cross-platform compatibility (#161): Windows support utilities and path normalization
- Platform detection helpers (
is_windows,is_macos,is_linux) - Platform-aware directory paths for data, cache, config, and temp
- Windows Docker path normalization (e.g.,
C:\Users\...→/c/Users/...) - Shell detection (PowerShell/cmd fallback on Windows, bash on Unix)
- Docker socket path abstraction (Unix socket vs Windows named pipe)
- Platform detection helpers (
- VS Code extension scaffold (#163): Extension for running benchmarks from VS Code
- Sidebar views for benchmark results and configuration
- Commands: run benchmark, stop, select config, view results, open dashboard
BenchmarkRunnerclass spawning mcpbr CLI as child process- Tree data providers for results and config display
- Configuration settings for Python path and API host
- Plugin registry (#267): Discover and register MCP server benchmark plugins
PluginEntryandRegistrydataclasses with search by name, description, and tagsRegistryClientwith HTTP fetch, caching, search, and list operationsgenerate_registry_entry()for mcpbr self-registration- JSON-based registry format hosted on GitHub
- Rate limiting for API calls (#196): Intelligent rate limiting to prevent API quota exhaustion
- Token bucket algorithm with configurable requests-per-minute (
rate_limit_rpm) - Three backoff strategies: fixed, exponential, and adaptive (with jitter)
- Retry-After header parsing for automatic 429 recovery
- Real-time metrics tracking (throttled requests, total wait time, error counts)
- Configurable safety margin to stay within provider limits
- Token bucket algorithm with configurable requests-per-minute (
- Benchmark reproducibility (#136): Ensure evaluations are reproducible across runs
- Global random seed control (
global_seed) for deterministic evaluation ordering - Environment snapshot capture (Python version, platform, installed packages)
- Deterministic mode with
PYTHONHASHSEEDcontrol - SHA256 checksums for reproducibility report verification
- JSON-serializable reproducibility reports for sharing and auditing
- Global random seed control (
- Privacy controls (#120): PII detection, redaction, and data governance
- Three redaction levels: none, basic (emails/keys/IPs), strict (all PII patterns)
- 7 built-in PII patterns (email, API key, IPv4/IPv6, credit card, SSN, phone)
- Custom regex pattern support for organization-specific redaction
- Recursive dict redaction for nested result structures
- Task ID anonymization via SHA256 hashing
- Data retention policies with configurable expiry
- Field exclusion to strip sensitive result fields before saving
- Audit logging (#118): Tamper-proof audit trail for compliance and security
- 13 auditable event types covering config, benchmark, task, result, and data lifecycle
- HMAC-SHA256 hash chain for tamper detection and integrity verification
- JSON and CSV export formats for compliance reporting
- Configurable event filtering (log all or specific event types)
- Append-only JSONL file logging with automatic directory creation
- Standalone
verify_integrity()to detect log tampering
- Interactive tutorial system (#198, #157): CLI-based tutorials for learning mcpbr
mcpbr tutorial list— browse available tutorials with difficulty levelsmcpbr tutorial start <id>— step-by-step walkthrough with validationmcpbr tutorial progress— track completion across all tutorialsmcpbr tutorial reset <id>— restart a tutorial from scratch- 4 built-in tutorials: Getting Started, Configuration, Benchmarks, Analytics
- Progress persistence in
~/.mcpbr_state/tutorials/
- Comprehensive API reference documentation (#202, #156): Auto-generated docs from source
- SDK, Configuration, Benchmarks, Analytics, and Reports API pages
- mkdocstrings integration for live source rendering
- Enhanced benchmark guides (#203): 6 benchmark docs expanded to comprehensive guides
- HumanEval, GSM8K, GAIA, CyberGym, TerminalBench, MCPToolBench++ — each with evaluation methodology, configuration examples, best practices, and troubleshooting
- Plugin development guide (#160): Complete guide for extending mcpbr
- Custom benchmark creation with Benchmark protocol walkthrough
- Custom provider and custom metrics implementation
- Testing guidelines and publishing instructions
- Best practices guide enhancements (#159): 6 new sections added
- Security, Performance Optimization, CI/CD Integration, Troubleshooting, Cost Management, Analytics Best Practices
- 6 new FAQ entries covering common workflows
- Custom domain (#379): Documentation site now served at https://mcpbr.org/
- GitHub Pages CNAME configuration
- All documentation URLs updated to mcpbr.org
0.11.1 - 2026-02-07
- Misleading 'no changes detected' error (#409): When an agent wrote changes that were later reverted, the error field incorrectly reported "no changes detected." Now reports "working tree is clean - changes were likely reverted" with tool call counts, and suppresses the error entirely when a patch was generated and evaluated
0.11.0 - 2026-02-07
- AWS EC2 infrastructure provider (#352): Run evaluations on AWS EC2 instances with automatic provisioning, SSH connectivity, health checks, and full lifecycle management
- GCP Compute Engine infrastructure provider (#353): Run evaluations on GCP VMs with instance
provisioning via
gcloud computeCLI and GCP-managed keys - Kubernetes infrastructure provider (#355): Run evaluations as Kubernetes Jobs with resource management, ConfigMap/Secret creation, and pod log streaming
- Cloudflare Workers infrastructure provider (#354): Hybrid Worker-based evaluation with Wrangler CLI integration and KV namespace management
- Cloud storage support (#87): Upload results to S3, GCS, or Azure Blob Storage via CLI
--upload-toflag or YAMLcloud_storageconfig - Database backend (#86): SQLite storage for persistent evaluation results with async interface and query capabilities
- Sandbox execution (#109): Configurable Docker security profiles (permissive, standard, strict) with seccomp profiles and network controls
- Distributed execution (#116): Coordinate evaluations across multiple workers with task partitioning and automatic result merging
- Multi-benchmark runner (#359): Run multiple benchmarks in parallel with aggregated results
- REST API server (#83): HTTP API for querying results and server status via
mcpbr serve - Cross-platform compatibility (#161): Windows support with platform detection and Docker path normalization
- VS Code extension scaffold (#163): Extension with sidebar views, benchmark commands, and result tree providers
- Plugin registry (#267): Discover and register MCP server benchmark plugins
- Azure health check timeout (#375): Increased timeout from 60s to 120s (configurable), made quota check non-fatal
- CLI task filters not forwarded (#367): Task IDs via
-tflags now correctly forwarded to remote infrastructure providers
0.10.4 - 2026-02-06
- Retry empty workspace after Docker copy (#405): Under high concurrency, the Docker
filesystem copy from
/testbedto/workspacecan silently produce an empty workspace. Now retries with a sync and, if necessary, a full re-copy before raising an error
0.10.3 - 2026-02-06
- Slack notifications migrated to Bot API (#404): Uses
slack_sdk.WebClient.chat_postMessageinstead of raw webhook POST for richer formatting and thread support. Results JSON uploaded as a file snippet in a thread reply. GitHub Gist integration for result sharing. Webhook fallback on bot failure.slack_sdkis now an optional dependency (pip install mcpbr[slack])
0.10.2 - 2026-02-06
- Eval timeout misclassification (#399, #402):
exec_commandtimeouts inapply_patch()and_apply_test_patch()no longer bubble up asasyncio.TimeoutError. Timeouts increased from 30s to 120s for Docker exec calls under concurrent load - Container leak prevention (#400, #402):
env.cleanup()now has a 60s timeout with force-kill fallback to prevent leaked Docker containers - Cold-start staggering (#401, #403): First-batch concurrent task launches staggered by 1s per slot to avoid overwhelming Docker with simultaneous image pulls
- Zero-iteration retry (#401, #403): Tasks that timeout with 0 iterations are automatically retried once since images/containers are cached on the second attempt
- Rich Slack notifications (#398): Slack notifications now include file uploads and Gist links for full result sharing
0.10.0 - 2026-02-06
- Random sampling (#372): Wire
sampling.pyinto harness/config with--sampling-strategy(sequential/random/stratified),--random-seed, and--stratify-fieldCLI flags. Non-deterministic by default; same seed guarantees reproducibility - Dataset loading performance (#360, #361): Add
dataset.select(range(n))optimization to all 26 HuggingFace benchmarks to avoid materializing full datasets when sampling - Notifications (#80, #206, #207, #208): Slack, Discord, and email completion alerts
auto-dispatched from harness.
--notify-slack,--notify-discord,--notify-emailCLI flags - Azure monitoring (#363): Run state persistence,
run-status,run-ssh,run-stop,run-logsCLI commands for managing Azure VMs - Prometheus metrics (#81, #209):
prometheus.pywith proper exposition format andexport-metricsCLI command - W&B integration (#82):
wandb_integration.pywith lazy import,--wandb/--no-wandband--wandb-projectCLI flags - Result badges (#211):
badges.pywith shields.io URL generation andbadgeCLI command
0.9.1 - 2026-02-06
- Docker container name collision (#383): Added unique UUID suffix to container names and 409 Conflict recovery with stale container removal and retry
- asyncio.TimeoutError not caught (#384): Catch both
TimeoutErrorandasyncio.TimeoutErrorfor Python <3.11 compatibility; addedeval_timeout_secondsconfig field (default 600s) - MCP agent prompt missing workdir (#385): Added
{workdir}placeholder toMCP_PROMPT_SUFFIXso the agent knows the repository location - setup_command runs as wrong user (#386): setup_command now executes as the mcpbr user instead of root
- Workspace verification after copy (#387): Added filesystem sync and file-count check
- setup_command failures silently swallowed (#388): Non-zero exit codes now always print a warning
0.9.0 - 2026-02-06
- Rate limiting (#196): Token bucket algorithm with configurable requests-per-minute, three backoff strategies (fixed, exponential, adaptive), Retry-After header parsing, and real-time metrics tracking
- Benchmark reproducibility (#136): Global random seed control, environment snapshot capture,
deterministic mode with
PYTHONHASHSEED, and SHA256-checksummed reproducibility reports - Privacy controls (#120): Three-tier PII redaction (none, basic, strict), 7 built-in patterns, custom regex support, task ID anonymization, and data retention policies
- Audit logging (#118): HMAC-SHA256 hash chain for tamper detection, 13 auditable event types, JSON/CSV export, and JSONL append-only file logging
0.8.0 - 2026-02-06
- Interactive tutorial system (#198, #157): CLI-based tutorials with
mcpbr tutorial list,start,progress, andresetcommands. 4 built-in tutorials with progress persistence - API reference documentation (#202, #156): Auto-generated docs with mkdocstrings
- Enhanced benchmark guides (#203): 6 benchmark docs expanded to comprehensive guides
- Plugin development guide (#160): Complete guide for extending mcpbr
- Best practices guide enhancements (#159): 6 new sections including Security, Performance Optimization, CI/CD Integration, Troubleshooting, Cost Management
- Custom domain (#379): Documentation site now served at https://mcpbr.org/
0.7.0 - 2026-02-06
- Analytics package (#178, #179, #183, #61, #63, #57, #60, #180, #181, #182, #184, #185, #187, #226, #227): Comprehensive analytics and insights engine
- Historical results database (#178): SQLite-backed
ResultsDatabasefor storing, querying, and tracking evaluation runs over time - Performance regression detection (#179):
RegressionDetectorwith statistical significance testing for score, cost, latency, and token regressions - Multi-model comparison (#183):
ComparisonEnginewith Pareto frontier analysis, winner analysis, and pairwise statistical significance testing - Statistical significance testing (#61, #63): Chi-squared test, bootstrap confidence intervals, Cohen's d effect size, Mann-Whitney U, and permutation tests — all implemented with stdlib only (no numpy/scipy)
- Trend analysis (#57, #60): Time-series trend detection with linear regression, moving averages, and direction classification
- A/B testing framework (#180):
ABTestclass for controlled experiment analysis with resolution rate comparison - Leaderboard generation (#181): Ranked leaderboards with ASCII table and Markdown output, sortable by any metric
- Custom metrics registry (#182):
MetricsRegistrywith 5 built-in metrics and extensible custom metric registration - Benchmark difficulty estimation (#184): Task difficulty scoring based on resolution rates, cost, and iterations
- Correlation analysis (#185): Pearson and Spearman correlation with automatic metric pair analysis
- Anomaly detection (#187): Z-score, IQR, and MAD methods for detecting outlier tasks across cost, tokens, runtime, and iterations
- Error pattern analysis (#227):
ErrorPatternAnalyzerwith Jaccard similarity clustering, temporal pattern detection, and actionable recommendations - Cross-benchmark comparison (#226): Side-by-side comparison with task-level overlap analysis and unique-win identification
- Historical results database (#178): SQLite-backed
- Interactive HTML reports (#39, #55):
HTMLReportGeneratorwith Chart.js charts, dark mode toggle, sortable task tables, and responsive layout - Enhanced Markdown reports (#42):
EnhancedMarkdownGeneratorwith shields.io badges, mermaid pie/bar charts, collapsible sections, and analysis tables - PDF reports (#56):
PDFReportGeneratorwith CSS @media print styles, page numbers, custom branding, and optional weasyprint PDF export - Comparison reports (#53, #59):
mcpbr compareCLI command for multi-run comparison with statistical significance, winner analysis, and Pareto frontier - CLI report output flags:
--output-html,--output-markdown,--output-pdfoptions on themcpbr runcommand for generating reports alongside evaluation - CLI analytics subcommands:
mcpbr analytics store,mcpbr analytics trends,mcpbr analytics leaderboard,mcpbr analytics regressionfor database-backed analytics workflows - 310 new tests for analytics modules, report generators, and CLI commands
0.6.0 - 2026-02-05
- Graceful degradation (#70): Fault-tolerant task execution with failure isolation, classification (transient/permanent/unknown), configurable
continue_on_errorandmax_failurespolicies, execution checkpointing for crash recovery, and partial report generation- New config fields:
continue_on_error,max_failures,checkpoint_interval,resume_from_checkpoint
- New config fields:
- Multi-provider support (#229): Added OpenAI, Google Gemini, and Alibaba Qwen as model providers alongside Anthropic
OpenAIProviderfor GPT-4o, GPT-4 Turbo, and GPT-4o Mini modelsGeminiProviderfor Gemini 2.0 Flash, Gemini 1.5 Pro, and Gemini 1.5 Flash modelsQwenProviderfor Qwen Plus, Qwen Turbo, and Qwen Max models via DashScope API- New optional dependencies:
openai,gemini,all-providersextras - Pricing data for all 9 new models
- Model registry entries with context window and tool support metadata
- Multi-language support (#230): Cross-language benchmark execution for Python, JavaScript, TypeScript, Java, and Go
- Per-language Docker images, run/compile commands, and test framework configs
- Automatic language detection from filenames and code patterns
- Cross-language metrics aggregation
- Structured logging (#231): JSON and human-readable log formatters with contextual metadata
- File rotation, configurable log levels via
MCPBR_LOG_LEVELenv var LogContextcontext manager for injecting task/benchmark fields into log records
- File rotation, configurable log levels via
- Public Python SDK (#232): Programmatic API for configuring and running benchmarks
MCPBenchmarkclass with config from dict, YAML, orHarnessConfiglist_benchmarks(),list_providers(),list_models(),get_version()helpers- Exported in top-level
mcpbrpackage forfrom mcpbr import MCPBenchmark
- Platform distribution files: Docker, Conda, Homebrew, GitHub Actions, and CI templates
Dockerfile.appmulti-stage build for container deploymentdocker/docker-compose.ymlfor multi-container orchestrationconda/meta.yamlrecipe for Conda packagingaction/action.ymlGitHub Action with basic and matrix examplesci-templates/for GitLab CI and CircleCI integration
0.5.4 - 2026-02-06
- Retry empty workspace after Docker copy (#405): Under high concurrency, the Docker
filesystem copy from
/testbedto/workspacecan silently produce an empty workspace. Now retries with a sync and, if necessary, a full re-copy before raising an error.
0.5.3 - 2026-02-06
- Azure provisioner version pinning (#391):
pip install mcpbrchanged topip install mcpbr==<local_version>to prevent version mismatch on remote VMs
0.5.2 - 2026-02-06
- Docker container name collision (#383): Added unique UUID suffix to container names (backported from 0.9.1)
- asyncio.TimeoutError not caught (#384): Catch both
TimeoutErrorandasyncio.TimeoutError(backported from 0.9.1) - MCP agent prompt missing workdir (#385): Added
{workdir}placeholder (backported) - setup_command runs as wrong user (#386): Executes as mcpbr user (backported)
- Workspace verification after copy (#387): Added sync and file-count check (backported)
- setup_command failures silently swallowed (#388): Non-zero exit codes visible (backported)
0.5.1 - 2026-02-06
- setup_command timing (#378): setup_command now runs outside the agent timer, preventing expensive pre-computation from counting against the task timeout. Backported from main
0.5.0 - 2026-02-05
- Real-time web dashboard (#58): Live monitoring of benchmark evaluations via
DashboardServerwith FastAPI + WebSocket, task progress, resolution rate, ETA, and pause/resume/cancel controls - Interactive configuration wizard (#74): Step-by-step CLI wizard for creating config files with presets (filesystem, web-search, database), model/benchmark selection, and MCP server setup
- Dry-run mode (#84): Preview evaluation plan without executing — shows tasks, estimated cost/time, validates config, checks Docker and MCP server availability
- Task prioritization and scheduling (#92): Intelligent task ordering with speed-first, cost-first, coverage-first, and custom scoring strategies
- Color and formatting options (#105): Configurable output themes (default, minimal, plain) with NO_COLOR convention support and MCPBR_THEME env var
- Docker image pre-warming (#128): Pre-pull Docker images in parallel before evaluation starts with progress reporting and cache detection
- Result streaming to external storage (#131): Stream results as tasks complete to local JSONL files, S3-compatible storage, or webhooks with buffering and retry
- Memory-efficient large dataset handling (#134): Streaming and chunked loading of large HuggingFace datasets with memory monitoring and automatic chunk-size adaptation
- Task batching with smart scheduling (#137): Group similar tasks by repo/image/category to minimize Docker container restarts with adaptive batch sizing
- Resource limit configuration (#139): Configure CPU, memory, disk, PID, and network limits for Docker containers with monitoring and violation reporting
- Configuration migration tool (#195): Detect and migrate old config formats (V1→V4) with dry-run preview, backup, and chained migration steps
- Docker image caching optimization (#228): LRU cache management with size limits, usage tracking, eviction, warmup recommendations, and dangling image cleanup
- Zero-cost metrics on evaluation timeout (#374): Agent metrics (cost, tokens, iterations) were discarded when
benchmark.evaluate()timed out after the agent had already completed successfully. Now preserves agent results when available. - Process hang after evaluation completes (#374):
asyncio.run()blocked indefinitely during cleanup because Docker SDK urllib3 background threads kept the default executor alive. Now force-shuts down the executor withwait=False.
0.4.16 - 2026-02-05
- Custom benchmark support via YAML (#29, #47): Users can define custom benchmarks without writing Python code using YAML definition files with configurable evaluation types (exact_match, numeric, regex, script)
- Custom metrics framework (#64): Define and compute custom evaluation metrics beyond standard accuracy/pass rates, with composite metrics support and a built-in metric registry
- Failure analysis module (#67): Categorize and analyze evaluation failures with pattern extraction, failure reports, and actionable recommendations
- Random and stratified sampling (#142): Add sampling strategies (sequential, random, stratified) with seed control for reproducible benchmark task selection
- Dataset versioning (#138): Pin and track HuggingFace dataset versions for reproducible benchmark runs with manifest save/load support
- Latency and performance metrics (#129): Track task latency, time-to-first-tool-call, throughput, and percentile statistics (p50/p95/p99)
- GPU support for Docker containers (#121): Detect NVIDIA GPUs and configure Docker containers with GPU access for ML benchmarks
- Few-shot learning support (#127): Variable shot counts with selection strategies (random, similar, diverse) and learning curve analysis
- MMMU multi-modal benchmark (#123): Massive Multi-discipline Multimodal Understanding benchmark for image understanding tasks
- LongBench long-context benchmark (#125): Long-context benchmark with F1, ROUGE-L, classification accuracy, and edit similarity metrics across 21 subsets
- Adversarial testing benchmark (#126): Safety and robustness benchmark using HarmBench with refusal detection across jailbreak, hallucination, bias, and robustness categories
- MCPToolBench++ integration tests (#232): Comprehensive test suite for the MCPToolBench++ benchmark implementation
- 21 new benchmark implementations (#6, #7, #18, #19, #20, #22, #24, #25, #26, #27, #28, #33, #34, #35, #37, #38, #40, #45, #46, #49): Initial stub implementations for all planned benchmarks
- Repository-aware test commands for non-pytest projects (#365): Use upstream SWE-bench test command specs for sympy (
bin/test), sphinx (tox), and other non-pytest repos instead of defaulting topython -m pytest - Flaky Azure and trial mode tests: Fixed tests that depended on local
~/.ssh/mcpbr_azurestate and updated assertions for multi-step dependency installation - SEO improvements for documentation site
- Added robots.txt with sitemap reference
- Added Open Graph and Twitter Card meta tags on all pages
- Added per-page meta descriptions from frontmatter across all doc pages
- Updated site description to reflect all 25+ benchmarks (was SWE-bench only)
- Fixed SoftwareApplication schema version (was 0.1.0, now 0.4.5)
- Updated SoftwareApplication schema description to cover full benchmark suite
- Comprehensive benchmark documentation with individual tutorial pages for all 25+ benchmarks
- Each benchmark has a dedicated page with overview, task structure, CLI/YAML usage, evaluation methodology, examples, troubleshooting, and best practices
- Benchmarks organized by category: Software Engineering, Code Generation, Math & Reasoning, Knowledge & QA, Tool Use & Agents, ML Research, Code Understanding, Security
- New benchmarks overview page with master comparison table and category navigation
- HowTo JSON-LD structured data for SEO on benchmark pages
- Updated README with complete benchmark category table
- Fixed internal links across FAQ, configuration, CLI, and best practices docs
- Code generation: MBPP, APPS, CodeContests, BigCodeBench, LeetCode, CoderEval
- Math/reasoning: MATH, BigBench-Hard
- Knowledge/QA: TruthfulQA, HellaSwag, ARC
- Agent/interactive: GAIA, AgentBench, WebArena, TerminalBench, MLAgentBench, InterCode
- Tool-use: ToolBench
- Code editing: Aider Polyglot
- Repository understanding: RepoQA
- All benchmarks implement the full
Benchmarkprotocol withload_tasks,normalize_task,create_environment,evaluate,get_prebuilt_image, andget_prompt_template
0.4.15 - 2026-02-05
- Evaluation hang prevention (#370): Prevent evaluation from hanging after task completion
0.4.14 - 2026-02-05
- Setup script hook (#366): Add setup script hook before timed tasks begin
- Non-pytest test commands (#365): Use upstream SWE-bench test commands for non-pytest projects (sympy, sphinx, etc.)
0.4.13 - 2026-02-02
- MCP tool registration: Replaced
claude mcp addCLI command with.mcp.jsonfile-based configuration. The previous approach created broken tool registration where the MCP server connected but its tools failed with "No such tool available"
0.4.12 - 2026-02-02
- Azure recursive provisioning: Override infrastructure mode to local in VM config to prevent recursive Azure provisioning on the VM
0.4.11 - 2026-02-02
- Azure Docker daemon: Start and enable Docker daemon after installation via systemctl.
Use
bash -lc+sg dockerfor remote commands. Add--skip-preflightto test task validation
0.4.10 - 2026-02-02
- Azure VM dependency installation: Install Docker, Python (via deadsnakes PPA), Node.js,
and mcpbr in separate steps. Use
python{version} -m pip/mcpbrinstead of system binaries. Install Node.js for npx (needed by MCP servers)
0.4.9 - 2026-02-02
- Azure availability zones: New
zonefield in Azure config for specifying availability zones. Health checks now distinguish zone-specific vs location-wide SKU restrictions
0.4.8 - 2026-02-02
- Azure health check key mismatch:
InfrastructureManager.run_with_infrastructure()expectedhealthy/failureskeys butAzureProvider.health_check()returnserrors. Now handles both
0.4.7 - 2026-02-02
- Azure infrastructure CLI wiring:
infrastructure.mode: azurein config was parsed but never acted on. The CLI now routes toInfrastructureManager.run_with_infrastructure()for non-local infrastructure modes
0.4.5 - 2026-02-01
- 21 new benchmark implementations (#6, #7, #18, #19, #20, #22, #24, #25, #26, #27, #28,
#33, #34, #35, #37, #38, #40, #45, #46, #49): Expanded from 4 benchmarks to 25+
- Software Engineering: APPS, CodeContests, BigCodeBench, LeetCode, CoderEval, Aider Polyglot
- Code Generation: MBPP
- Math & Reasoning: MATH, BigBench-Hard
- Knowledge & QA: TruthfulQA, HellaSwag, ARC, GAIA
- Tool Use & Agents: ToolBench, AgentBench, WebArena, TerminalBench, InterCode
- ML Research: MLAgentBench
- Code Understanding: RepoQA
- SEO improvements: Added robots.txt, Open Graph/Twitter Card meta tags, per-page meta descriptions, and fixed SoftwareApplication schema
- 26 new benchmark documentation pages with individual tutorials, evaluation methodology, configuration examples, and troubleshooting for all benchmarks
0.4.4 - 2026-02-01
- Azure Infrastructure Provider (#351, #356): Run evaluations on Azure VMs instead of local Docker
- Automatic VM provisioning with configurable CPU, memory, and disk
- SSH connectivity with key management
- Remote dependency installation (Docker, Python, mcpbr)
- Configuration transfer and environment variable export
- Test task validation before full evaluation
- Artifact collection with ZIP archiving
- Auto-cleanup with configurable preservation on error
- Comprehensive health checks (Azure CLI, authentication, quotas)
- Abstract
InfrastructureProviderbase class for future providers InfrastructureManagerfactory with full lifecycle orchestration- Custom exceptions (
UnknownInfrastructureModeError,InfrastructureHealthCheckError) - New config fields:
infrastructure.mode(local/azure),infrastructure.azure.* - Full backward compatibility when
infrastructurefield is omitted
- Django test runner support (#93): Detect Django test format and use
./tests/runtests.pyinstead of pytest- Django tasks in SWE-bench previously had 0% success rate due to wrong test runner
- Detects both parenthesized format (
test_method (module.tests.TestClass)) and dot-separated format (module.tests.TestClass.test_method)
0.4.3 - 2026-02-01
- Side-by-Side MCP Server Comparison (#350): A/B test two MCP servers in a single evaluation run
- New
comparison_modeflag enables dual-server evaluation - Configure servers with
mcp_server_aandmcp_server_bfields - Results show side-by-side comparison with delta metrics and improvement percentages
- Per-task comparison showing which server won each task
- Unique wins analysis (tasks where only one server succeeded)
- Full backward compatibility with single-server mode
- Comprehensive documentation in
docs/comparison-mode.md
- New
- GSM8K Benchmark (#306): Math reasoning evaluation support
- Evaluate LLM math reasoning on Grade School Math 8K dataset
- 1,319 test problems covering arithmetic, algebra, and multi-step reasoning
- Automatic answer extraction from multiple formats (GSM8K, LaTeX, natural language)
- Configurable tolerance for numeric comparison
- Chain-of-thought prompting support
- Comprehensive documentation in
docs/benchmarks.md
- Docker Environment Robustness (#337, #339, #342, #341): Multiple improvements to Docker-based evaluation
- Git installation: Automatically install git in containers that don't have it (#337)
- Fallback images: Handle missing pre-built images gracefully with build-from-scratch fallback (#339)
- Retry logic: Retry Docker operations on transient failures with exponential backoff (#342)
- Cleanup: Ensure temp files are cleaned up on all exit paths (#341)
- Error Handling (#343, #344): Improved error messages and resilience
- Parse errors: Better handling of malformed JSON responses from agents (#343)
- Error messages: More detailed error context including log paths and suggestions (#344)
- Health Check Tests: Fixed 3 failing tests by using
sys.executableinstead ofpythoncommand- Issue:
pythoncommand doesn't exist on many modern systems (onlypython3) - Tests now use the actual Python interpreter being used by pytest
- Issue:
- Comparison Mode Integration: Fixed state tracker and preflight checks for comparison mode
- State tracker now computes correct config hash when using mcp_server_a/b
- Preflight checks now validate both servers in comparison mode
- Fixes AttributeError when running side-by-side evaluations
- GSM8K Configuration: Added gsm8k to VALID_BENCHMARKS list
- Benchmark validation now accepts gsm8k as a valid benchmark option
- Updated field description to include gsm8k and humaneval
- GSM8K Compatibility: Added filter parameters to GSM8K load_tasks signature
- Added filter_difficulty, filter_category, filter_tags for compatibility
- Parameters unused for GSM8K but needed for benchmark interface consistency
- Fixes TypeError when running GSM8K evaluations
- Refactored release workflow to auto-bump version after GitHub UI release
- Removed automated release creation workflow (releases now created manually in GitHub UI)
- Restored Release Drafter for auto-generating draft releases from PRs
- Added post-release workflow that automatically bumps patch version on main after release publish
0.4.2 - 2026-01-29
- Added npm package links throughout documentation and README
- Added thinking_budget configuration documentation with examples and validation notes
- Fixed incorrect CLI override documentation (thinking_budget is YAML-only, no CLI flag exists)
- Added npm badge and installation section to docs
0.4.1 - 2026-01-29
- HumanEval Benchmark (#304): Code generation benchmark support
- Evaluate LLM code generation capabilities on HumanEval dataset
- Test function implementations against unit test suites
- Automatic solution extraction and evaluation
- Extended Thinking Mode (#332): Support for Claude's extended thinking
thinking_budgetconfiguration option (1024-31999 tokens)- Enables deeper reasoning for complex tasks
- Validation with clear error messages for invalid values
- Environment variable injection for both Docker and local execution
- Performance Profiling (#331): Comprehensive profiling infrastructure
- Track tool call latencies with percentiles (p50, p95, p99)
- Memory usage monitoring (peak and average RSS/VMS)
- Infrastructure overhead measurement (Docker, MCP startup)
- Automated insights generation from profiling data
- Enable with
--profileflag orenable_profilingconfig
- Benchmark Filtering (#305): Filter tasks by difficulty, category, and tags
filter_difficulty- Filter by task difficulty levelfilter_category- Filter by task categoryfilter_tags- Filter by custom tags (all must match)- CLI flags:
--filter-difficulty,--filter-category,--filter-tags
- Runtime Tracking (#326): Track task execution duration
- Records total runtime for each task evaluation
- Helps identify performance bottlenecks
- Included in profiling reports
- HumanEval Git Detection (#335): Fixed git diff not detecting newly created files
- Added fallback to unfiltered git diff when filtered diff is empty
- Resolves issue where
solution.pywasn't detected in HumanEval tasks - Applied same pattern to both Docker and local execution paths
- Cost Calculation (#330): Use total_cost_usd from API to include cache tokens
- Accurate cost tracking including prompt caching discounts
- Prevents underreporting of actual API costs
0.4.0 - 2026-01-28
- MCP Server Log Capture (#287): Comprehensive logging for MCP server debugging
- Automatic capture of MCP server stdout/stderr to
~/.mcpbr_state/logs/{instance_id}_mcp.log - MCP log path included in all error messages for easy access
- Real-time log writing during task execution for troubleshooting
- Proper cleanup in finally block to ensure logs are always saved
- Automatic capture of MCP server stdout/stderr to
- Separate MCP Registration Phase (#285): Better error visibility for MCP initialization
- MCP server registration now separate from Claude CLI execution with its own 60s timeout
- Detailed logging showing exact MCP registration command and arguments
- Clear success/failure messages for MCP server startup
- Early detection of MCP server initialization issues
- MCP Timeout Configuration (#286): Configurable timeouts for MCP servers
startup_timeout_msfield for MCP server startup (default: 60 seconds)tool_timeout_msfield for long-running MCP tool calls (default: 15 minutes)- Environment variables passed to Claude CLI for timeout configuration
- Documented recommended timeouts for different server types
- Integration Tests (#286): Comprehensive test coverage for MCP logging features
- Tests for log file creation and cleanup
- Tests for error message formatting with log paths
- Tests for stdout/stderr capture in error messages
- Tests for timeout handling and temp file cleanup
- Improved Error Messages (#285, #286): Much more detailed MCP-related error reporting
- Registration failures now show exact command, exit code, and stderr
- MCP stdout included in error messages for better diagnostics
- Timeouts distinguish between init timeout vs execution timeout
- 0-iteration failures include hint to check MCP server logs
- All MCP errors include log file path for debugging
- Better Resource Cleanup (#286): Temp files cleaned up on all exit paths
- API keys in env files cleaned up before early returns
- Cleanup happens on registration failure, timeout, and normal exit
- Cleanup errors logged when verbose mode enabled
- Documentation (#286): Added comprehensive MCP debugging section
- MCP server log location and access instructions
- MCP tool timeout configuration examples
- Recommended timeout values by server type
- Registration failure troubleshooting steps
- Examples of new error message formats
startup_timeout_msandtool_timeout_msconfiguration examples
- Shell Injection Prevention (#286): All MCP commands properly escaped
- All MCP command arguments now quoted with
shlex.quote() - Environment variable values safely escaped
- Handles spaces and special characters correctly in paths
- Path traversal prevention in MCP log filenames
- All MCP command arguments now quoted with
- Resource Leaks (#286): Fixed temp file leaks on MCP registration failure
- Temp files containing
ANTHROPIC_API_KEYnow cleaned up before early returns - Fixed cleanup bypass in registration failure and timeout paths
- Temp files containing
- Unused Variables (#286): Fixed RUF059 lint warnings
mcp_stdoutnow used in error messages instead of discarded
0.3.29 - 2026-01-26
- XML Output Format (#301): Convert results to XML format with
mcpbr export xml(closes #13)- XML export for results data
- Easy integration with XML-based tools
- One-Liner Install (#302): Simplified installation with auto-init
curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash- Automatically installs and runs quick test
- Improved onboarding experience
- Expanded Claude Code plugin installation options with multiple methods (#275)
0.3.28 - 2026-01-26
- Default Logging (#295): Detailed execution logs enabled by default
- Logs automatically saved to
output_dir/logs/to prevent data loss - Critical for expensive evaluation runs ($50+)
- Add
--disable-logsordisable_logs: trueto opt-out
- Logs automatically saved to
- Consolidated Output Directory (#296): All outputs in single timestamped directory
- Default:
.mcpbr_run_YYYYMMDD_HHMMSS/ - Contains: config.yaml, evaluation_state.json, logs/, README.txt
- Easy archiving:
tar -czf results.tar.gz .mcpbr_run_* - Customizable with
--output-dirflag oroutput_dirconfig option
- Default:
- Fixed timeout statistics calculation bug
0.3.27 - 2026-01-24
- TypeError in Log Formatter (#294): Fixed Read tool offset/limit parameter handling
- Changed defaults from empty string to None
- Added proper type conversion before arithmetic operations
- Resolves intermittent failures affecting ~5-7% of SWE-bench tasks
- Added comprehensive regression tests
0.3.26 - 2026-01-23
- MCP Observability (#292): Enhanced logging and error reporting for MCP servers
- Better visibility into MCP server behavior
- Improved error messages for debugging
- MCP Pre-Flight Health Checks (#293): Validate MCP server before evaluation
- Early detection of MCP server issues
- Prevent wasted evaluation time on misconfigured servers
0.3.25 - 2026-01-23
- Testing automated release workflow
0.3.24 - 2026-01-23
- Automated Release Workflow: One-command releases via GitHub Actions
- Command:
gh workflow run release.yml -f version_bump=patch - Automatic version syncing across all package files
- Integrated with PyPI and npm publication
- New Release Guide and AI Agent Guide
- Command:
- Docker TypeError (#290): Convert instance_id to string in Docker labels
- Prevents TypeError when instance_id is not a string
- Improves Docker label handling robustness
- Added release documentation links to README
0.3.23 - 2026-01-23
- Comprehensive MCP Initialization Logging (#286): Better error handling and logging
- Detailed logging for MCP server initialization
- Enhanced error messages for troubleshooting
- Improved timeout handling
- Fixed version sync across all package files
0.3.22 - 2026-01-22
- npm package now correctly includes README.md (removed NPM.md experiment)
- Previous attempt to use separate NPM.md did not work
- Package now displays main README on npm registry
0.3.21 - 2026-01-22
- npm Package Documentation: Attempted to use NPM.md for npm-specific docs
- Configured npm package to display NPM.md on registry
- Added npm installation instructions to README quick-start
- Added dedicated npm installation section before pip instructions
0.3.20 - 2026-01-22
- npm Publishing: Updated unscoped CLI package name to
mcpbr-cli- Unscoped CLI package is now
mcpbr-cli(command is stillmcpbr) - Scoped package remains
@greynewell/mcpbr - Plugin packages remain
mcpbr-claude-plugin(unscoped) and@greynewell/mcpbr-claude-plugin(scoped) - Updated documentation to reflect new package names
- All packages provide the same CLI command:
mcpbr
- Unscoped CLI package is now
0.3.19 - 2026-01-22
- Documentation: Updated all documentation to prefer unscoped npm package names
- Primary installation method now uses
mcpbr-cliinstead of@greynewell/mcpbr - Plugin installation now uses
mcpbr-claude-plugininstead of@greynewell/mcpbr-claude-plugin - Scoped versions still available and mentioned as alternatives
- Updated NPM.md, .claude-plugin/README.md with new package references
- Primary installation method now uses
0.3.18 - 2026-01-22
- Plugin Marketplace: Added
marketplace.jsonfor Claude Code plugin manager installation (closes #266)- Proper marketplace manifest with owner information
- GitHub source configuration for plugin distribution
- Automatic installation via Claude Code plugin manager
- Plugin Documentation: Comprehensive documentation for the Claude Code plugin (closes #273)
- Installation guide with multiple methods
- Complete skills reference for all three skills
- Architecture explanation and execution flow
- Troubleshooting section with common issues
- FAQ covering plugin usage and development
- Plugin Validation: Automated validation of plugin manifests in CI
- Python-based validation script for plugin.json and marketplace.json
- Checks required fields, structure, and schema compliance
- Runs on every PR to prevent breaking changes
- Community Guidelines: Added Code of Conduct for project contributors
- Contributor Covenant code of conduct
- Guidelines for respectful collaboration
- Version Sync: Enhanced version sync script to include marketplace.json
- Syncs version across plugin.json, marketplace.json, and package.json files
- Updates both root and nested plugin versions
- Ensures consistency across all distribution channels
- CI/CD: Updated CI to validate all version files are synced
- Checks plugin.json, marketplace.json, package.json versions
- Prevents version drift across npm packages
- npm Publishing: First official npm package releases
- Scoped packages:
@greynewell/mcpbr- CLI wrapper for cross-platform installation@greynewell/mcpbr-claude-plugin- Claude Code plugin package
- Unscoped packages (same content, easier names):
mcpbr- CLI wrapper for cross-platform installationmcpbr-claude-plugin- Claude Code plugin package
- Automated publishing workflow on GitHub releases
- Cross-publishes to both scoped and unscoped names for maximum discoverability
- Scoped packages:
0.3.17 - 2026-01-22
- Claude Code Plugin: Added specialized Claude Code plugin with custom skills for enhanced MCP server evaluation
- Custom skills tailored for benchmarking workflows (
benchmark-swe-lite,mcpbr-config,mcpbr-eval) - Integration with Claude Code CLI for improved agent capabilities
- Published as npm package
@greynewell/mcpbr-claude-plugin
- Custom skills tailored for benchmarking workflows (
- npm CLI Wrapper: Added Node.js wrapper (
bin/mcpbr.js) for cross-platform npm installation- Published as
@greynewell/mcpbron npm - Automatic detection and invocation of Python CLI
- Supports npx usage:
npx @greynewell/mcpbr
- Published as
- Improved pre-commit configuration for consistent linting in CI
- Addressed code quality issues from CodeRabbit review
- Added
__main__.pyto enablepython -m mcpbrexecution - Fixed
--versionflag handling at CLI group level
0.3.16 - 2026-01-22
- MCP Docker context preservation: Use
suwithout login flag to preserve project context for MCP servers
0.3.15 - 2026-01-21
- CSV Export: New
mcpbr export csvcommand to convert results to CSV format (closes #11)- Export evaluation results with customizable columns
- Supports filtering by status (pass/fail/error)
- Easy integration with spreadsheet tools
0.3.14 - 2026-01-21
- Streaming Results Support: Real-time streaming of evaluation results to SSE endpoint (closes #259)
- Enable with
--stream-toflag orstream_toconfig option - Immediate visibility into evaluation progress
- Compatible with web dashboards and monitoring tools
- Enable with
0.3.13 - 2026-01-21
- Incremental Evaluation: Only re-run failed tasks with
--incrementalflag (closes #258)- Automatically resume from previous failed evaluations
- Skip already passing tasks to save time and cost
- Useful for iterating on fixes or retrying flaky tests
0.3.12 - 2026-01-20
- Configuration Validation Schema: Comprehensive JSON schema validation for config files (closes #98)
- Validate config files with
mcpbr config validate - Clear error messages for invalid configurations
- IDE autocompletion support via schema
- Validate config files with
0.3.11 - 2026-01-20
- Benchmark Result Caching: Cache evaluation results to avoid redundant runs (closes #255)
- Automatic caching of task evaluations
- Significant speed improvements for repeated runs
- Cache invalidation based on configuration changes
0.3.10 - 2026-01-19
- Config Inheritance: Support for
extendsfield in configuration files (closes #106)- Reuse common settings across multiple config files
- Override specific fields while inheriting base configuration
- Supports relative and absolute paths
0.3.9 - 2026-01-19
- Smoke Test Mode: New
--smoke-testflag to validate setup without full evaluation (closes #253)- Quickly verify environment configuration
- Test single task instance to ensure everything works
- Useful for CI/CD pipeline validation
- Incremental Saving: Automatically save results after each task completion (closes #252)
- Prevents data loss on crashes or interruptions
- Resume evaluations from last completed task
- Intermediate results available during long-running evaluations
0.3.8 - 2026-01-18
- Environment Variable Support: Use environment variables in config files with
${VAR}syntax (closes #251)- Reference secrets and dynamic values
- Example:
api_key: ${ANTHROPIC_API_KEY} - Supports default values:
${VAR:-default}
- Documentation: Fixed broken links in FAQ
0.3.7 - 2026-01-18
- Best Practices Guide: Comprehensive guide for effective MCP server evaluation (closes #199)
- Configuration optimization tips
- Performance tuning recommendations
- Common pitfalls and solutions
0.3.6 - 2026-01-17
- Example Configurations Repository: Added collection of example config files (closes #200)
- Ready-to-use configurations for common scenarios
- Templates for different providers and harnesses
- Documented best practices
0.3.5 - 2026-01-17
- FAQ Documentation: Comprehensive frequently asked questions section (closes #204)
- Common issues and solutions
- Setup troubleshooting
- Performance optimization tips
0.3.3 - 2026-01-17
- Docker Cleanup Enhancements: Comprehensive Docker resource cleanup on errors (closes #194)
- Automatic cleanup of containers on failure
- Prevention of resource leaks
- Improved error handling
0.3.2 - 2026-01-17
- Tool Coverage Reporting: Analyze which MCP tools were used during evaluation (closes #225)
- Track tool usage frequency
- Identify unused tools
- Coverage metrics in results JSON
0.3.1 - 2026-01-17
- MCPToolBench++ Integration: Support for MCPToolBench++ benchmark (closes #223)
- Evaluate MCP servers on standardized tool usage tasks
- Integration with MCPToolBench++ dataset
- Comprehensive tool evaluation capabilities
0.3.0 - 2026-01-17
- Regression Detection: Automated detection of performance regressions (closes #65)
- Compare results against baseline
- Configurable regression thresholds
- Alert notifications for detected regressions
- YAML Export: Convert results to YAML format with
mcpbr export yaml(closes #12)- Human-readable output format
- Easy integration with other tools
- Enhanced Agent Documentation: Comprehensive AGENTS.md with guidelines for AI agents
- Expanded documentation with comprehensive guidelines
- Updated roadmap with v1.0 milestones
0.2.3 - 2026-01-17
- v1.0 Roadmap: Comprehensive roadmap section in README
- Planned features and improvements
- Timeline and milestones
- Improved MCP tool error handling and logging (closes #10)
- Better error messages for tool failures
- Enhanced debugging capabilities
0.2.2 - 2026-01-17
- Version bump for release consistency
0.2.1 - 2026-01-17
- Prevent dataset configuration override issues
- Ensure dataset settings are properly applied
- Fix configuration precedence
- Improved README with one-line install instructions
0.2.0 - 2026-01-17
- CyberGym Benchmark Support: Added support for CyberGym security vulnerability benchmark
- Agents generate Proof-of-Concept (PoC) exploits for real CVEs
- 4 difficulty levels (0-3) controlling context provided to agent
- Evaluation by crash detection with AddressSanitizer
- Automatic C/C++ build environment setup with sanitizers
- Benchmark Abstraction Layer: New flexible architecture for multiple benchmarks
- Protocol-based
Benchmarkinterface with standardized methods - Easy to add new benchmarks by implementing the protocol
BenchmarkTaskandEvaluationSpecfor normalized task representation
- Protocol-based
- Benchmark Registry: Factory pattern for creating benchmark instances
create_benchmark()function for dynamic benchmark selectionlist_benchmarks()to see available benchmarks
- New CLI Commands:
mcpbr benchmarks- List available benchmarks with descriptions
- New CLI Options:
--benchmark/-b- Select benchmark (swe-bench or cybergym)--level- Set CyberGym difficulty level (0-3)
- Configuration Fields:
benchmark: Select which benchmark to run (default: "swe-bench")cybergym_level: Set CyberGym difficulty (default: 1)dataset: Now optional, benchmark provides default
- Benchmark-Specific Agent Prompts: Different prompts for bug fixing vs exploit generation
- Comprehensive Documentation:
- New Benchmarks Guide page
- Updated README with CyberGym information
- Updated configuration and CLI documentation
- Architecture documentation updated for benchmark abstraction
- Test Coverage: 20 new tests for benchmark implementations and protocol compliance
- Refactored SWE-bench logic into
benchmarks/swebench.py(no behavior change) - Harness orchestrator now uses benchmark abstraction instead of direct dataset loading
- Agent prompts are now provided by benchmarks (can still be overridden in config)
- Dataset field in config is now optional (benchmarks provide defaults)
- Updated example configurations to include benchmark options
- Fully backward compatible - all existing SWE-bench workflows continue to work unchanged
- Default benchmark is still SWE-bench
- All existing configuration files work without modification
0.1.1 - 2026-01-18
- Fixed README image display on mirrors such as PyPI
0.1.0 - 2026-01-17
- Initial release of mcpbr (Model Context Protocol Benchmark Runner)
- Support for multiple LLM providers:
- OpenRouter (multi-provider gateway)
- OpenAI (direct API)
- Anthropic (direct API)
- Google AI / Gemini (direct API)
- Agent harness implementations:
- Claude Code CLI with MCP server support
- OpenAI Codex CLI
- OpenCode CLI
- Gemini CLI
- Docker-based task isolation for SWE-bench evaluation
- Baseline agent for comparison (no tools)
- JSON and Markdown report generation
- Configurable evaluation parameters via YAML
- CLI commands:
run,init,models,harnesses,providers - Real-time streaming output when using
--verboseflag with Claude Code harness - Tool usage tracking: counts tool calls by name and includes breakdown in results JSON
- Each streamed output line is prefixed with task instance ID for parallel worker disambiguation
- In-container agent execution: The Claude Code CLI now runs inside the Docker container where all dependencies are installed. This ensures Python imports work correctly (e.g.,
from astropy import ...) and the agent can verify fixes by running tests. - Pre-built SWE-bench Docker images from Epoch AI's registry (
ghcr.io/epoch-research/swe-bench.eval) are now used when available, providing:- Repository at the correct commit with all dependencies pre-installed
- Consistent, reproducible evaluation environments
- x86_64 images with automatic emulation on ARM64 (Apple Silicon) Macs
- Timestamped log filenames to prevent overwrites:
{instance_id}_{run_type}_{YYYYMMDD_HHMMSS}.json --no-prebuiltCLI flag to disable pre-built images and build from scratch- Network access for containers to enable API calls from within Docker