fix: dead-code benchmark correctness + binary asset filtering#5
fix: dead-code benchmark correctness + binary asset filtering#5greynewell merged 4 commits intomainfrom
Conversation
_run_mcp_evaluation was passing the original task (with baseline problem_statement) to agent.solve, even though create_environment had already switched to problem_statement_enhanced locally. The fix creates task_for_agent with the enhanced prompt before calling agent.solve, ensuring the agent receives enhanced_prompt_v2 (run the pre-computed analysis script) instead of the baseline prompt (explore the codebase). Also adds suppress_mcp_suffix mechanism so SupermodelBenchmark can opt out of MCP_PROMPT_SUFFIX (which conflicted with enhanced_prompt_v2's "do not grep the codebase" instruction), and rewrites enhanced_prompt_v2 to instruct the agent to run a Python script rather than read the large analysis file directly into context. Fixes: path-outside-allowed-directory errors (logto, n8n), token-limit errors (latitude). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the API call fails (expired key, network error, etc.), the analysis file was never written. This caused enhanced_prompt_v2 to error out (FileNotFoundError in the Python script), making the task show as ERROR instead of FAIL with 0% recall. Now an empty analysis file is written on failure so the agent can complete its task cleanly, producing an empty REPORT.json and a clear 0% recall result. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds BINARY_EXCLUDE_PATTERNS (images, video, fonts, archives, audio) applied to every zip regardless of per-task config. For git-archive repos this rewrites the zip with zipfile to strip matching entries since git archive has no native exclude support. For cal.com this drops the upload from ~144 MB to ~16 MB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdded a per-benchmark flag to suppress appending a generic MCP prompt suffix, consolidated dead-code analysis into a single-file script-driven prompt, and added zip post-processing to exclude binary basenames when creating repository archives; the suppress flag is threaded through harness creation and agent execution. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/mcpbr/harnesses.py (1)
1377-1390: Updatecreate_harnessdocstring to includesuppress_mcp_suffix.The new public parameter is wired correctly but undocumented in the Args section, which makes the factory API easy to misuse.
📝 Suggested docstring patch
Args: harness_name: Name of the harness (currently only 'claude-code'). model: Optional model override. mcp_server: MCP server configuration (used by claude-code harness). prompt: Custom prompt template. Use {problem_statement} placeholder. max_iterations: Maximum agent iterations (used by claude-code harness). verbosity: Verbosity level for logging (0=silent, 1=summary, 2=detailed). log_file: Optional file handle for writing raw JSON logs. mcp_logs_dir: Directory for MCP server logs. thinking_budget: Extended thinking token budget. Set to enable thinking mode. + suppress_mcp_suffix: If True, skip appending MCP_PROMPT_SUFFIX when MCP is active.As per coding guidelines, "Add docstrings to all public functions, classes, and modules using Google-style docstrings".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/harnesses.py` around lines 1377 - 1390, The create_harness docstring is missing documentation for the new public parameter suppress_mcp_suffix; update the Args section of the create_harness docstring to add an entry for suppress_mcp_suffix (bool, default False) describing that when True the harness will not append the MCP-specific suffix to generated harness names/IDs (affects MCP-related naming/behavior), and mention its effect in the return/context if relevant so callers understand how this flag changes naming and logging for MCP-enabled harnesses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 126-129: The script only reads analysis["deadCodeCandidates"] and
misses chunked schemas; update the load logic in dead_code.py to support both
formats by checking if analysis contains "chunkFiles" and, if so, iterating over
and loading each referenced chunk file to aggregate candidates, otherwise fall
back to analysis["deadCodeCandidates"]; preserve existing filtering against
analysis["entryPoints"] (so entryPoints still remove false positives) and ensure
the final dead_code output is written from the aggregated candidate list.
In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 199-203: _filter_zip_entries currently matches only
os.path.basename(item.filename), so path-based patterns like "loc/*" or "lib/*"
never match; update the filtering to match against the full (normalized) entry
path instead of basename (e.g., use the archive entry path from item.filename
after normalizing separators) and apply fnmatch.fnmatch to that full path so
zip_exclude/zip_repo path patterns work; ensure pattern normalization/leading
"./" differences are handled consistently with how zip_exclude patterns are
specified.
---
Nitpick comments:
In `@src/mcpbr/harnesses.py`:
- Around line 1377-1390: The create_harness docstring is missing documentation
for the new public parameter suppress_mcp_suffix; update the Args section of the
create_harness docstring to add an entry for suppress_mcp_suffix (bool, default
False) describing that when True the harness will not append the MCP-specific
suffix to generated harness names/IDs (affects MCP-related naming/behavior), and
mention its effect in the return/context if relevant so callers understand how
this flag changes naming and logging for MCP-enabled harnesses.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ccc1a7bf-8921-4670-a8ef-5c8cc65f6636
📒 Files selected for processing (5)
src/mcpbr/benchmarks/supermodel/benchmark.pysrc/mcpbr/benchmarks/supermodel/endpoints/dead_code.pysrc/mcpbr/benchmarks/supermodel/git_utils.pysrc/mcpbr/harness.pysrc/mcpbr/harnesses.py
| The file `supermodel_dead_code_analysis.json` in your working directory contains: | ||
| - `metadataSummary`: totalCandidates, rootFilesCount, reasonBreakdown, confidenceBreakdown | ||
| - `chunkFiles`: list of chunk files with candidate details | ||
| - `deadCodeCandidates`: all candidates (may be large — do NOT read the whole file manually) | ||
| - `entryPoints`: symbols confirmed alive — any candidate matching an entry point is a false positive |
There was a problem hiding this comment.
Prompt/schema mismatch will zero out results on normal runs.
This script reads only analysis["deadCodeCandidates"], but the generated supermodel_dead_code_analysis.json index is chunked (chunkFiles) in src/mcpbr/benchmarks/supermodel/benchmark.py (Line 514-518). Result: script writes empty dead_code despite real candidates existing in chunk files.
🔧 Suggested prompt-script update (support both schemas)
-# Filter candidates
-dead_code = []
-for c in analysis.get("deadCodeCandidates", []):
+# Collect candidates (direct list or chunked index format)
+candidates = analysis.get("deadCodeCandidates")
+if candidates is None:
+ candidates = []
+ for ref in analysis.get("chunkFiles", []):
+ chunk_file = ref.get("file")
+ if not chunk_file:
+ continue
+ with open(chunk_file) as cf:
+ chunk_json = json.load(cf)
+ candidates.extend(chunk_json.get("deadCodeCandidates", []))
+
+# Filter candidates
+dead_code = []
+for c in candidates:Also applies to: 147-148, 171-174
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 126 -
129, The script only reads analysis["deadCodeCandidates"] and misses chunked
schemas; update the load logic in dead_code.py to support both formats by
checking if analysis contains "chunkFiles" and, if so, iterating over and
loading each referenced chunk file to aggregate candidates, otherwise fall back
to analysis["deadCodeCandidates"]; preserve existing filtering against
analysis["entryPoints"] (so entryPoints still remove false positives) and ensure
the final dead_code output is written from the aggregated candidate list.
| for item in zin.infolist(): | ||
| basename = os.path.basename(item.filename) | ||
| if any(fnmatch.fnmatch(basename, pat) for pat in patterns): | ||
| removed += 1 | ||
| continue |
There was a problem hiding this comment.
Path-based excludes are silently ignored in git-archive filtering.
_filter_zip_entries only matches os.path.basename(item.filename), so excludes like loc/* / lib/* (documented in zip_repo) never match. This makes caller zip_exclude ineffective on the git-archive path.
💡 Proposed fix
def _filter_zip_entries(zip_path: str, patterns: list[str]) -> None:
"""Rewrite zip in-place, removing entries whose basename matches any glob pattern."""
@@
for item in zin.infolist():
- basename = os.path.basename(item.filename)
- if any(fnmatch.fnmatch(basename, pat) for pat in patterns):
+ rel_path = item.filename.lstrip("./")
+ basename = os.path.basename(rel_path)
+ if any(
+ fnmatch.fnmatch(rel_path, pat) or fnmatch.fnmatch(basename, pat)
+ for pat in patterns
+ ):
removed += 1
continue
zout.writestr(item, zin.read(item.filename))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 199 - 203,
_filter_zip_entries currently matches only os.path.basename(item.filename), so
path-based patterns like "loc/*" or "lib/*" never match; update the filtering to
match against the full (normalized) entry path instead of basename (e.g., use
the archive entry path from item.filename after normalizing separators) and
apply fnmatch.fnmatch to that full path so zip_exclude/zip_repo path patterns
work; ensure pattern normalization/leading "./" differences are handled
consistently with how zip_exclude patterns are specified.
… file Chunking was the root cause of multiple failures: - Missing cross-chunk import edges caused GT items to appear "alive" - Agents reading incomplete chunk sets got degraded recall - Multiple files complicated the write-tracking session issue All candidates now go directly into supermodel_dead_code_analysis.json under a top-level deadCodeCandidates array. Prompt updated to match. chunk_and_analyze.py and the cached chunked analysis files are deleted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Three fixes to the Supermodel dead-code benchmark discovered during a full 18-task benchmark run against real GitHub PRs.
_run_mcp_evaluationwas sending the wrong prompt to the agent.create_environment()updated a local copy of the task withproblem_statement_enhanced, butagent.solve()received the original task with the baseline prompt. The agent was doing full codebase exploration instead of running the pre-computed filter script — causing token limit errors, path-traversal errors, and 30+ minute timeouts on large repos.API failure left analysis file missing. When the Supermodel API fails (network error, expired key, server timeout), no analysis file was written. With
enhanced_prompt_v2, the agent then errored withFileNotFoundError. Now writes an empty analysis file so the agent completes cleanly with 0 results instead of crashing.Binary assets inflate zip size 9×. Repos like cal.com track images, videos, and fonts in git.
git archiveincludes everything, producing a 144 MB upload. AddedBINARY_EXCLUDE_PATTERNSapplied to every zip — drops cal.com from 144 MB → 16 MB. Post-processes the zip using Python'szipfilemodule sincegit archivehas no native exclude support.Test plan
enhanced_prompt_v2— confirm agent runs the Python filter script in 2 Bash calls instead of exploring the codebase*.png,*.mp4,*.woff2etc. are absent from the zip🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Enhanced Functionality