Skip to content

fix: dead-code benchmark correctness + binary asset filtering#5

Merged
greynewell merged 4 commits intomainfrom
feat/supermodel-dead-code-fixes
Mar 25, 2026
Merged

fix: dead-code benchmark correctness + binary asset filtering#5
greynewell merged 4 commits intomainfrom
feat/supermodel-dead-code-fixes

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Mar 25, 2026

Summary

Three fixes to the Supermodel dead-code benchmark discovered during a full 18-task benchmark run against real GitHub PRs.

  • _run_mcp_evaluation was sending the wrong prompt to the agent. create_environment() updated a local copy of the task with problem_statement_enhanced, but agent.solve() received the original task with the baseline prompt. The agent was doing full codebase exploration instead of running the pre-computed filter script — causing token limit errors, path-traversal errors, and 30+ minute timeouts on large repos.

  • API failure left analysis file missing. When the Supermodel API fails (network error, expired key, server timeout), no analysis file was written. With enhanced_prompt_v2, the agent then errored with FileNotFoundError. Now writes an empty analysis file so the agent completes cleanly with 0 results instead of crashing.

  • Binary assets inflate zip size 9×. Repos like cal.com track images, videos, and fonts in git. git archive includes everything, producing a 144 MB upload. Added BINARY_EXCLUDE_PATTERNS applied to every zip — drops cal.com from 144 MB → 16 MB. Post-processes the zip using Python's zipfile module since git archive has no native exclude support.

Test plan

  • Run a trial against a task that uses enhanced_prompt_v2 — confirm agent runs the Python filter script in 2 Bash calls instead of exploring the codebase
  • Simulate an API failure — confirm empty analysis file is written and agent completes without error
  • Archive cal.com repo — confirm zip size is ~16 MB instead of ~144 MB
  • Verify *.png, *.mp4, *.woff2 etc. are absent from the zip

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Configurable option to suppress the MCP guidance suffix when running evaluations.
    • Repository zips now automatically exclude binary files.
  • Bug Fixes

    • Analysis runs now produce a safe, empty analysis artifact when generation fails.
  • Enhanced Functionality

    • Dead-code reporting simplified to a single consolidated analysis file and a streamlined verification/report output.

greynewell and others added 3 commits March 24, 2026 20:13
_run_mcp_evaluation was passing the original task (with baseline
problem_statement) to agent.solve, even though create_environment
had already switched to problem_statement_enhanced locally.

The fix creates task_for_agent with the enhanced prompt before calling
agent.solve, ensuring the agent receives enhanced_prompt_v2 (run the
pre-computed analysis script) instead of the baseline prompt (explore
the codebase).

Also adds suppress_mcp_suffix mechanism so SupermodelBenchmark can opt
out of MCP_PROMPT_SUFFIX (which conflicted with enhanced_prompt_v2's
"do not grep the codebase" instruction), and rewrites enhanced_prompt_v2
to instruct the agent to run a Python script rather than read the large
analysis file directly into context.

Fixes: path-outside-allowed-directory errors (logto, n8n), token-limit
errors (latitude).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the API call fails (expired key, network error, etc.), the analysis
file was never written. This caused enhanced_prompt_v2 to error out
(FileNotFoundError in the Python script), making the task show as ERROR
instead of FAIL with 0% recall.

Now an empty analysis file is written on failure so the agent can
complete its task cleanly, producing an empty REPORT.json and a clear
0% recall result.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds BINARY_EXCLUDE_PATTERNS (images, video, fonts, archives, audio)
applied to every zip regardless of per-task config. For git-archive
repos this rewrites the zip with zipfile to strip matching entries
since git archive has no native exclude support.

For cal.com this drops the upload from ~144 MB to ~16 MB.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c289ad5-224d-45e9-b326-a5e43b9a5099

📥 Commits

Reviewing files that changed from the base of the PR and between bd4fae5 and c3db97b.

📒 Files selected for processing (1)
  • src/mcpbr/benchmarks/supermodel/benchmark.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mcpbr/benchmarks/supermodel/benchmark.py

Walkthrough

Added a per-benchmark flag to suppress appending a generic MCP prompt suffix, consolidated dead-code analysis into a single-file script-driven prompt, and added zip post-processing to exclude binary basenames when creating repository archives; the suppress flag is threaded through harness creation and agent execution.

Changes

Cohort / File(s) Summary
MCP suffix wiring
src/mcpbr/benchmarks/supermodel/benchmark.py, src/mcpbr/harness.py, src/mcpbr/harnesses.py
Added suppress_mcp_suffix = True on the Supermodel benchmark; harness creation now reads suppress_mcp_suffix from a benchmark and forwards it into create_harness; ClaudeCodeHarness/prompt construction conditionally omits MCP_PROMPT_SUFFIX when the flag is true.
Dead-code endpoint prompt
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
Rewrote DeadCodePlugin.enhanced_prompt_v2 to use a single self-contained Python script that reads supermodel_dead_code_analysis.json, builds an entry-point whitelist, filters deadCodeCandidates (skipping entry points and Type/interface reasons), and writes REPORT.json; removed earlier chunk-file aggregation and analysis-quality checks.
Repository zipping & filtering
src/mcpbr/benchmarks/supermodel/git_utils.py
Introduced BINARY_EXCLUDE_PATTERNS, merged exclude patterns into zip_repo, added exclude_patterns support to the git-archive path, and implemented _filter_zip_entries to stream-copy a zip and remove entries whose basenames match glob patterns, logging removed-count and safely replacing the original archive.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Quiet flag flips off the suffix light,
One JSON holds the dead-code sight,
Zip gets a scrub, basenames take flight,
Bench, harness, archive—tidy and tight. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main changes: fixes to dead-code benchmark correctness and binary asset filtering, which align with the core objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/supermodel-dead-code-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/mcpbr/harnesses.py (1)

1377-1390: Update create_harness docstring to include suppress_mcp_suffix.

The new public parameter is wired correctly but undocumented in the Args section, which makes the factory API easy to misuse.

📝 Suggested docstring patch
     Args:
         harness_name: Name of the harness (currently only 'claude-code').
         model: Optional model override.
         mcp_server: MCP server configuration (used by claude-code harness).
         prompt: Custom prompt template. Use {problem_statement} placeholder.
         max_iterations: Maximum agent iterations (used by claude-code harness).
         verbosity: Verbosity level for logging (0=silent, 1=summary, 2=detailed).
         log_file: Optional file handle for writing raw JSON logs.
         mcp_logs_dir: Directory for MCP server logs.
         thinking_budget: Extended thinking token budget. Set to enable thinking mode.
+        suppress_mcp_suffix: If True, skip appending MCP_PROMPT_SUFFIX when MCP is active.

As per coding guidelines, "Add docstrings to all public functions, classes, and modules using Google-style docstrings".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/harnesses.py` around lines 1377 - 1390, The create_harness
docstring is missing documentation for the new public parameter
suppress_mcp_suffix; update the Args section of the create_harness docstring to
add an entry for suppress_mcp_suffix (bool, default False) describing that when
True the harness will not append the MCP-specific suffix to generated harness
names/IDs (affects MCP-related naming/behavior), and mention its effect in the
return/context if relevant so callers understand how this flag changes naming
and logging for MCP-enabled harnesses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 126-129: The script only reads analysis["deadCodeCandidates"] and
misses chunked schemas; update the load logic in dead_code.py to support both
formats by checking if analysis contains "chunkFiles" and, if so, iterating over
and loading each referenced chunk file to aggregate candidates, otherwise fall
back to analysis["deadCodeCandidates"]; preserve existing filtering against
analysis["entryPoints"] (so entryPoints still remove false positives) and ensure
the final dead_code output is written from the aggregated candidate list.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 199-203: _filter_zip_entries currently matches only
os.path.basename(item.filename), so path-based patterns like "loc/*" or "lib/*"
never match; update the filtering to match against the full (normalized) entry
path instead of basename (e.g., use the archive entry path from item.filename
after normalizing separators) and apply fnmatch.fnmatch to that full path so
zip_exclude/zip_repo path patterns work; ensure pattern normalization/leading
"./" differences are handled consistently with how zip_exclude patterns are
specified.

---

Nitpick comments:
In `@src/mcpbr/harnesses.py`:
- Around line 1377-1390: The create_harness docstring is missing documentation
for the new public parameter suppress_mcp_suffix; update the Args section of the
create_harness docstring to add an entry for suppress_mcp_suffix (bool, default
False) describing that when True the harness will not append the MCP-specific
suffix to generated harness names/IDs (affects MCP-related naming/behavior), and
mention its effect in the return/context if relevant so callers understand how
this flag changes naming and logging for MCP-enabled harnesses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccc1a7bf-8921-4670-a8ef-5c8cc65f6636

📥 Commits

Reviewing files that changed from the base of the PR and between 6c76da8 and bd4fae5.

📒 Files selected for processing (5)
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/benchmarks/supermodel/git_utils.py
  • src/mcpbr/harness.py
  • src/mcpbr/harnesses.py

Comment on lines +126 to 129
The file `supermodel_dead_code_analysis.json` in your working directory contains:
- `metadataSummary`: totalCandidates, rootFilesCount, reasonBreakdown, confidenceBreakdown
- `chunkFiles`: list of chunk files with candidate details
- `deadCodeCandidates`: all candidates (may be large — do NOT read the whole file manually)
- `entryPoints`: symbols confirmed alive — any candidate matching an entry point is a false positive
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Prompt/schema mismatch will zero out results on normal runs.

This script reads only analysis["deadCodeCandidates"], but the generated supermodel_dead_code_analysis.json index is chunked (chunkFiles) in src/mcpbr/benchmarks/supermodel/benchmark.py (Line 514-518). Result: script writes empty dead_code despite real candidates existing in chunk files.

🔧 Suggested prompt-script update (support both schemas)
-# Filter candidates
-dead_code = []
-for c in analysis.get("deadCodeCandidates", []):
+# Collect candidates (direct list or chunked index format)
+candidates = analysis.get("deadCodeCandidates")
+if candidates is None:
+    candidates = []
+    for ref in analysis.get("chunkFiles", []):
+        chunk_file = ref.get("file")
+        if not chunk_file:
+            continue
+        with open(chunk_file) as cf:
+            chunk_json = json.load(cf)
+        candidates.extend(chunk_json.get("deadCodeCandidates", []))
+
+# Filter candidates
+dead_code = []
+for c in candidates:

Also applies to: 147-148, 171-174

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 126 -
129, The script only reads analysis["deadCodeCandidates"] and misses chunked
schemas; update the load logic in dead_code.py to support both formats by
checking if analysis contains "chunkFiles" and, if so, iterating over and
loading each referenced chunk file to aggregate candidates, otherwise fall back
to analysis["deadCodeCandidates"]; preserve existing filtering against
analysis["entryPoints"] (so entryPoints still remove false positives) and ensure
the final dead_code output is written from the aggregated candidate list.

Comment on lines +199 to +203
for item in zin.infolist():
basename = os.path.basename(item.filename)
if any(fnmatch.fnmatch(basename, pat) for pat in patterns):
removed += 1
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Path-based excludes are silently ignored in git-archive filtering.

_filter_zip_entries only matches os.path.basename(item.filename), so excludes like loc/* / lib/* (documented in zip_repo) never match. This makes caller zip_exclude ineffective on the git-archive path.

💡 Proposed fix
 def _filter_zip_entries(zip_path: str, patterns: list[str]) -> None:
     """Rewrite zip in-place, removing entries whose basename matches any glob pattern."""
@@
             for item in zin.infolist():
-                basename = os.path.basename(item.filename)
-                if any(fnmatch.fnmatch(basename, pat) for pat in patterns):
+                rel_path = item.filename.lstrip("./")
+                basename = os.path.basename(rel_path)
+                if any(
+                    fnmatch.fnmatch(rel_path, pat) or fnmatch.fnmatch(basename, pat)
+                    for pat in patterns
+                ):
                     removed += 1
                     continue
                 zout.writestr(item, zin.read(item.filename))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 199 - 203,
_filter_zip_entries currently matches only os.path.basename(item.filename), so
path-based patterns like "loc/*" or "lib/*" never match; update the filtering to
match against the full (normalized) entry path instead of basename (e.g., use
the archive entry path from item.filename after normalizing separators) and
apply fnmatch.fnmatch to that full path so zip_exclude/zip_repo path patterns
work; ensure pattern normalization/leading "./" differences are handled
consistently with how zip_exclude patterns are specified.

… file

Chunking was the root cause of multiple failures:
- Missing cross-chunk import edges caused GT items to appear "alive"
- Agents reading incomplete chunk sets got degraded recall
- Multiple files complicated the write-tracking session issue

All candidates now go directly into supermodel_dead_code_analysis.json
under a top-level deadCodeCandidates array. Prompt updated to match.
chunk_and_analyze.py and the cached chunked analysis files are deleted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant