fix: dead-code benchmark correctness + binary asset filtering by greynewell · Pull Request #5 · supermodeltools/mcpbr

greynewell · 2026-03-25T00:14:12Z

Summary

Three fixes to the Supermodel dead-code benchmark discovered during a full 18-task benchmark run against real GitHub PRs.

_run_mcp_evaluation was sending the wrong prompt to the agent. create_environment() updated a local copy of the task with problem_statement_enhanced, but agent.solve() received the original task with the baseline prompt. The agent was doing full codebase exploration instead of running the pre-computed filter script — causing token limit errors, path-traversal errors, and 30+ minute timeouts on large repos.
API failure left analysis file missing. When the Supermodel API fails (network error, expired key, server timeout), no analysis file was written. With enhanced_prompt_v2, the agent then errored with FileNotFoundError. Now writes an empty analysis file so the agent completes cleanly with 0 results instead of crashing.
Binary assets inflate zip size 9×. Repos like cal.com track images, videos, and fonts in git. git archive includes everything, producing a 144 MB upload. Added BINARY_EXCLUDE_PATTERNS applied to every zip — drops cal.com from 144 MB → 16 MB. Post-processes the zip using Python's zipfile module since git archive has no native exclude support.

Test plan

Run a trial against a task that uses enhanced_prompt_v2 — confirm agent runs the Python filter script in 2 Bash calls instead of exploring the codebase
Simulate an API failure — confirm empty analysis file is written and agent completes without error
Archive cal.com repo — confirm zip size is ~16 MB instead of ~144 MB
Verify *.png, *.mp4, *.woff2 etc. are absent from the zip

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Configurable option to suppress the MCP guidance suffix when running evaluations.
- Repository zips now automatically exclude binary files.
Bug Fixes
- Analysis runs now produce a safe, empty analysis artifact when generation fails.
Enhanced Functionality
- Dead-code reporting simplified to a single consolidated analysis file and a streamlined verification/report output.

_run_mcp_evaluation was passing the original task (with baseline problem_statement) to agent.solve, even though create_environment had already switched to problem_statement_enhanced locally. The fix creates task_for_agent with the enhanced prompt before calling agent.solve, ensuring the agent receives enhanced_prompt_v2 (run the pre-computed analysis script) instead of the baseline prompt (explore the codebase). Also adds suppress_mcp_suffix mechanism so SupermodelBenchmark can opt out of MCP_PROMPT_SUFFIX (which conflicted with enhanced_prompt_v2's "do not grep the codebase" instruction), and rewrites enhanced_prompt_v2 to instruct the agent to run a Python script rather than read the large analysis file directly into context. Fixes: path-outside-allowed-directory errors (logto, n8n), token-limit errors (latitude). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the API call fails (expired key, network error, etc.), the analysis file was never written. This caused enhanced_prompt_v2 to error out (FileNotFoundError in the Python script), making the task show as ERROR instead of FAIL with 0% recall. Now an empty analysis file is written on failure so the agent can complete its task cleanly, producing an empty REPORT.json and a clear 0% recall result. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds BINARY_EXCLUDE_PATTERNS (images, video, fonts, archives, audio) applied to every zip regardless of per-task config. For git-archive repos this rewrites the zip with zipfile to strip matching entries since git archive has no native exclude support. For cal.com this drops the upload from ~144 MB to ~16 MB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-25T00:14:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c289ad5-224d-45e9-b326-a5e43b9a5099

📥 Commits

Reviewing files that changed from the base of the PR and between bd4fae5 and c3db97b.

📒 Files selected for processing (1)

src/mcpbr/benchmarks/supermodel/benchmark.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/mcpbr/benchmarks/supermodel/benchmark.py

Walkthrough

Added a per-benchmark flag to suppress appending a generic MCP prompt suffix, consolidated dead-code analysis into a single-file script-driven prompt, and added zip post-processing to exclude binary basenames when creating repository archives; the suppress flag is threaded through harness creation and agent execution.

Changes

Cohort / File(s)	Summary
MCP suffix wiring `src/mcpbr/benchmarks/supermodel/benchmark.py`, `src/mcpbr/harness.py`, `src/mcpbr/harnesses.py`	Added `suppress_mcp_suffix = True` on the Supermodel benchmark; harness creation now reads `suppress_mcp_suffix` from a benchmark and forwards it into `create_harness`; `ClaudeCodeHarness`/prompt construction conditionally omits `MCP_PROMPT_SUFFIX` when the flag is true.
Dead-code endpoint prompt `src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`	Rewrote `DeadCodePlugin.enhanced_prompt_v2` to use a single self-contained Python script that reads `supermodel_dead_code_analysis.json`, builds an entry-point whitelist, filters `deadCodeCandidates` (skipping entry points and Type/interface reasons), and writes `REPORT.json`; removed earlier chunk-file aggregation and analysis-quality checks.
Repository zipping & filtering `src/mcpbr/benchmarks/supermodel/git_utils.py`	Introduced `BINARY_EXCLUDE_PATTERNS`, merged exclude patterns into `zip_repo`, added `exclude_patterns` support to the git-archive path, and implemented `_filter_zip_entries` to stream-copy a zip and remove entries whose basenames match glob patterns, logging removed-count and safely replacing the original archive.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Quiet flag flips off the suffix light,
One JSON holds the dead-code sight,
Zip gets a scrub, basenames take flight,
Bench, harness, archive—tidy and tight. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main changes: fixes to dead-code benchmark correctness and binary asset filtering, which align with the core objectives of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/supermodel-dead-code-fixes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/mcpbr/harnesses.py (1)

1377-1390: Update create_harness docstring to include suppress_mcp_suffix.

The new public parameter is wired correctly but undocumented in the Args section, which makes the factory API easy to misuse.

📝 Suggested docstring patch

     Args:
         harness_name: Name of the harness (currently only 'claude-code').
         model: Optional model override.
         mcp_server: MCP server configuration (used by claude-code harness).
         prompt: Custom prompt template. Use {problem_statement} placeholder.
         max_iterations: Maximum agent iterations (used by claude-code harness).
         verbosity: Verbosity level for logging (0=silent, 1=summary, 2=detailed).
         log_file: Optional file handle for writing raw JSON logs.
         mcp_logs_dir: Directory for MCP server logs.
         thinking_budget: Extended thinking token budget. Set to enable thinking mode.
+        suppress_mcp_suffix: If True, skip appending MCP_PROMPT_SUFFIX when MCP is active.

As per coding guidelines, "Add docstrings to all public functions, classes, and modules using Google-style docstrings".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/harnesses.py` around lines 1377 - 1390, The create_harness
docstring is missing documentation for the new public parameter
suppress_mcp_suffix; update the Args section of the create_harness docstring to
add an entry for suppress_mcp_suffix (bool, default False) describing that when
True the harness will not append the MCP-specific suffix to generated harness
names/IDs (affects MCP-related naming/behavior), and mention its effect in the
return/context if relevant so callers understand how this flag changes naming
and logging for MCP-enabled harnesses.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 126-129: The script only reads analysis["deadCodeCandidates"] and
misses chunked schemas; update the load logic in dead_code.py to support both
formats by checking if analysis contains "chunkFiles" and, if so, iterating over
and loading each referenced chunk file to aggregate candidates, otherwise fall
back to analysis["deadCodeCandidates"]; preserve existing filtering against
analysis["entryPoints"] (so entryPoints still remove false positives) and ensure
the final dead_code output is written from the aggregated candidate list.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 199-203: _filter_zip_entries currently matches only
os.path.basename(item.filename), so path-based patterns like "loc/*" or "lib/*"
never match; update the filtering to match against the full (normalized) entry
path instead of basename (e.g., use the archive entry path from item.filename
after normalizing separators) and apply fnmatch.fnmatch to that full path so
zip_exclude/zip_repo path patterns work; ensure pattern normalization/leading
"./" differences are handled consistently with how zip_exclude patterns are
specified.

---

Nitpick comments:
In `@src/mcpbr/harnesses.py`:
- Around line 1377-1390: The create_harness docstring is missing documentation
for the new public parameter suppress_mcp_suffix; update the Args section of the
create_harness docstring to add an entry for suppress_mcp_suffix (bool, default
False) describing that when True the harness will not append the MCP-specific
suffix to generated harness names/IDs (affects MCP-related naming/behavior), and
mention its effect in the return/context if relevant so callers understand how
this flag changes naming and logging for MCP-enabled harnesses.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccc1a7bf-8921-4670-a8ef-5c8cc65f6636

📥 Commits

Reviewing files that changed from the base of the PR and between 6c76da8 and bd4fae5.

📒 Files selected for processing (5)

src/mcpbr/benchmarks/supermodel/benchmark.py
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
src/mcpbr/benchmarks/supermodel/git_utils.py
src/mcpbr/harness.py
src/mcpbr/harnesses.py

coderabbitai · 2026-03-25T00:19:07Z

src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py

+The file `supermodel_dead_code_analysis.json` in your working directory contains:
 - `metadataSummary`: totalCandidates, rootFilesCount, reasonBreakdown, confidenceBreakdown
- `chunkFiles`: list of chunk files with candidate details
+- `deadCodeCandidates`: all candidates (may be large — do NOT read the whole file manually)
 - `entryPoints`: symbols confirmed alive — any candidate matching an entry point is a false positive


⚠️ Potential issue | 🔴 Critical

Prompt/schema mismatch will zero out results on normal runs.

This script reads only analysis["deadCodeCandidates"], but the generated supermodel_dead_code_analysis.json index is chunked (chunkFiles) in src/mcpbr/benchmarks/supermodel/benchmark.py (Line 514-518). Result: script writes empty dead_code despite real candidates existing in chunk files.

🔧 Suggested prompt-script update (support both schemas)

-# Filter candidates -dead_code = [] -for c in analysis.get("deadCodeCandidates", []): +# Collect candidates (direct list or chunked index format) +candidates = analysis.get("deadCodeCandidates") +if candidates is None: + candidates = [] + for ref in analysis.get("chunkFiles", []): + chunk_file = ref.get("file") + if not chunk_file: + continue + with open(chunk_file) as cf: + chunk_json = json.load(cf) + candidates.extend(chunk_json.get("deadCodeCandidates", [])) + +# Filter candidates +dead_code = [] +for c in candidates:

Also applies to: 147-148, 171-174

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 126 - 129, The script only reads analysis["deadCodeCandidates"] and misses chunked schemas; update the load logic in dead_code.py to support both formats by checking if analysis contains "chunkFiles" and, if so, iterating over and loading each referenced chunk file to aggregate candidates, otherwise fall back to analysis["deadCodeCandidates"]; preserve existing filtering against analysis["entryPoints"] (so entryPoints still remove false positives) and ensure the final dead_code output is written from the aggregated candidate list.

coderabbitai · 2026-03-25T00:19:08Z

src/mcpbr/benchmarks/supermodel/git_utils.py

+            for item in zin.infolist():
+                basename = os.path.basename(item.filename)
+                if any(fnmatch.fnmatch(basename, pat) for pat in patterns):
+                    removed += 1
+                    continue


⚠️ Potential issue | 🟠 Major

Path-based excludes are silently ignored in git-archive filtering.

_filter_zip_entries only matches os.path.basename(item.filename), so excludes like loc/* / lib/* (documented in zip_repo) never match. This makes caller zip_exclude ineffective on the git-archive path.

💡 Proposed fix

def _filter_zip_entries(zip_path: str, patterns: list[str]) -> None: """Rewrite zip in-place, removing entries whose basename matches any glob pattern.""" @@ for item in zin.infolist(): - basename = os.path.basename(item.filename) - if any(fnmatch.fnmatch(basename, pat) for pat in patterns): + rel_path = item.filename.lstrip("./") + basename = os.path.basename(rel_path) + if any( + fnmatch.fnmatch(rel_path, pat) or fnmatch.fnmatch(basename, pat) + for pat in patterns + ): removed += 1 continue zout.writestr(item, zin.read(item.filename))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 199 - 203, _filter_zip_entries currently matches only os.path.basename(item.filename), so path-based patterns like "loc/*" or "lib/*" never match; update the filtering to match against the full (normalized) entry path instead of basename (e.g., use the archive entry path from item.filename after normalizing separators) and apply fnmatch.fnmatch to that full path so zip_exclude/zip_repo path patterns work; ensure pattern normalization/leading "./" differences are handled consistently with how zip_exclude patterns are specified.

… file Chunking was the root cause of multiple failures: - Missing cross-chunk import edges caused GT items to appear "alive" - Agents reading incomplete chunk sets got degraded recall - Multiple files complicated the write-tracking session issue All candidates now go directly into supermodel_dead_code_analysis.json under a top-level deadCodeCandidates array. Prompt updated to match. chunk_and_analyze.py and the cached chunked analysis files are deleted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greynewell and others added 3 commits March 24, 2026 20:13

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

greynewell merged commit e35a179 into main Mar 25, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dead-code benchmark correctness + binary asset filtering#5

fix: dead-code benchmark correctness + binary asset filtering#5
greynewell merged 4 commits intomainfrom
feat/supermodel-dead-code-fixes

greynewell commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

greynewell commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greynewell commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading