Skip to content

feat: add Supermodel dead-code benchmark plugin#4

Merged
greynewell merged 6 commits intomainfrom
feat/supermodel-dead-code-benchmark
Mar 24, 2026
Merged

feat: add Supermodel dead-code benchmark plugin#4
greynewell merged 6 commits intomainfrom
feat/supermodel-dead-code-benchmark

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Mar 24, 2026

Summary

  • Adds SupermodelBenchmark — a PR-based dead-code detection benchmark that uses GitHub PRs as ground truth and the Supermodel API for pre-computed dependency graph analysis
  • Adds DeadCodeBenchmark — a corpus-based variant for offline evaluation
  • Registers both in the benchmark registry with full config, harness, and model wiring

New files

  • src/mcpbr/benchmarks/supermodel/ — full benchmark package
    • benchmark.py — SupermodelBenchmark class (PR-based GT extraction, Supermodel API, chunked analysis v3)
    • api_client.py — async polling client for the Supermodel API
    • evaluation.py — P/R/F1 scoring
    • git_utils.pyclone_repo_at_commit, zip_repo helpers
    • endpoints/ — pluggable endpoint system (dead_code implemented; impact, test-coverage, circular-deps stubbed)
  • src/mcpbr/benchmarks/deadcode.py — corpus-based DeadCodeBenchmark variant

Config fields added

benchmark: "supermodel"
analysis_type: "dead-code"
tasks:
  - id: directus_pr26311
    repo: directus/directus
    pr_number: 26311
    merge_commit: "acf6d3d4..."
    language: typescript
supermodel_api_base: "https://api.supermodel.dev"
supermodel_api_key: "${SUPERMODEL_API_KEY}"
supermodel_api_timeout: 900
resolved_threshold: 0.8
ground_truth_dir: ~/.cache/mcpbr/supermodel_ground_truth

Other changes

  • models.py / pricing.py — add claude-opus-4-6 and claude-sonnet-4-6
  • cli.py — microsecond timestamps for output dirs (avoids collisions on concurrent runs)
  • docker_env.py + Dockerfile — use MCR mirror for python:3.11-slim to work around Docker Hub rate limiting
  • pyproject.tomlS608 per-file ignore for benchmarks/ (false positive on prompt template f-strings)

Test plan

  • python3 -m mcpbr run -c <config> --mcp-only -t directus_pr26311 --trial-mode -vv completes without error
  • Ground truth is extracted from PR diff and cached
  • Supermodel API is called and analysis JSON is placed in workdir
  • Agent scores produce a P/R/F1 result
  • benchmark: "dead-code" and benchmark: "supermodel" configs both validate

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added Dead Code and Supermodel benchmark suites, with endpoint plugins, Supermodel API client, repo/git utilities, evaluation helpers, environment provisioning, task loading, caching, and reporting.
  • New Models

    • Added Claude Opus 4.6 and Claude Sonnet 4.6 model entries (200K context window) and pricing.
  • Chores

    • Bumped package version to 0.14.1.
    • Switched Docker base image source to a mirrored python:3.11-slim.
    • Added per-file lint ignores for benchmark files and updated config/pricing schema.

Adds a first-class Supermodel benchmark to mcpbr that evaluates dead-code
detection using GitHub PRs as ground truth and the Supermodel API for
pre-computed dependency graph analysis.

## New files
- `benchmarks/supermodel/` — benchmark package
  - `benchmark.py` — SupermodelBenchmark; PR-based GT extraction,
    Supermodel API integration, chunked analysis packaging (v3)
  - `api_client.py` — async polling client for Supermodel API
  - `evaluation.py` — P/R/F1 scoring
  - `git_utils.py` — clone_repo_at_commit, zip_repo helpers
  - `endpoints/` — pluggable endpoint system
    - `dead_code.py` — DeadCodePlugin with enhanced_prompt_v2
    - stubs for impact/test-coverage/circular-deps
- `benchmarks/deadcode.py` — corpus-based DeadCodeBenchmark variant

## Changes to existing files
- `benchmarks/__init__.py` — register SupermodelBenchmark + DeadCodeBenchmark
- `config.py` — add "supermodel" + "dead-code" to VALID_BENCHMARKS; add
  supermodel config fields (analysis_type, tasks, supermodel_api_base,
  supermodel_api_key, supermodel_api_timeout, resolved_threshold,
  ground_truth_dir)
- `harness.py` — wire supermodel/dead-code benchmark_kwargs
- `models.py` — add claude-opus-4-6, claude-sonnet-4-6
- `pricing.py` — add 4.6 model pricing
- `cli.py` — use microsecond timestamps for output dirs to avoid collisions
- `docker_env.py` + `Dockerfile` — use MCR mirror for python:3.11-slim base
  to work around Docker Hub rate limiting
- `pyproject.toml` — add S608 per-file-ignore for benchmarks/ (false positive
  on prompt templates)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Walkthrough

Adds two new benchmarks ("dead-code" and "supermodel") with Supermodel endpoint/plugin infra, async API client, git/zip helpers, evaluation utilities, and bench registration; updates config/harness, Docker fallback image mirror, Claude 4.6 models/pricing, and bumps package version to 0.14.1.

Changes

Cohort / File(s) Summary
Docker & fallback
Dockerfile, src/mcpbr/docker_env.py
Swapped Docker base image source to Microsoft mirror (mcr.microsoft.com/mirror/docker/library/python:3.11-slim).
Project config / Lint
pyproject.toml
Added per-file Ruff ignore entries disabling S608 for two benchmark files.
Version bump
src/mcpbr/__init__.py
Bumped package __version__ from 0.14.00.14.1.
Benchmark registry & exports
src/mcpbr/benchmarks/__init__.py
Registered and exported DeadCodeBenchmark and SupermodelBenchmark in __all__ and BENCHMARK_REGISTRY.
Dead-code benchmark
src/mcpbr/benchmarks/deadcode.py
New DeadCodeBenchmark: corpus cloning/updating, task loading/filtering, prompt generation, env prep (safe writes + git init), evaluation via REPORT.json or extracted text, and PR/corpus handling.
Supermodel benchmark & client
src/mcpbr/benchmarks/supermodel/benchmark.py, src/mcpbr/benchmarks/supermodel/api_client.py
New SupermodelBenchmark and async call_supermodel_api: zip/upload repos, idempotency-key handling, polling loop (dummy re-post), response parsing, caching, chunked analysis packaging, and evaluation.
Endpoint plugins & base
src/mcpbr/benchmarks/supermodel/endpoints/...
Added EndpointPlugin base, endpoint registry (get_endpoint), fully implemented DeadCodePlugin (diff parsing + response filtering), and stubs for CircularDepsPlugin, ImpactAnalysisPlugin, TestCoveragePlugin.
Supermodel utilities
src/mcpbr/benchmarks/supermodel/{evaluation,git_utils}.py
Added evaluation helpers (normalize/build sets, compute PRF1) and async git utilities (clone at commit, pre-merge commit, zip repo with scope/excludes).
Shared bench utilities
src/mcpbr/benchmarks/_bench_utils.py
New helpers: safe_write_file, init_git_workdir, and extract_findings_from_text for parsing agent output JSON arrays.
Config & harness
src/mcpbr/config.py, src/mcpbr/harness.py
Added dead-code and supermodel to VALID_BENCHMARKS; extended HarnessConfig with Supermodel fields; harness forwards resolved_threshold and Supermodel options into benchmark creation.
Models & pricing
src/mcpbr/models.py, src/mcpbr/pricing.py
Added Claude 4.6 variants (claude-opus-4-6, claude-sonnet-4-6) to supported models and pricing (200k context, prompt-caching entries).
Minor updates / glue
src/mcpbr/benchmarks/codegraph.py, other small files
Minor formatting/UTC constant change and small glue edits to integrate new benchmarks.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Harness
    participant Bench as SupermodelBenchmark
    participant GitHub
    participant SupermodelAPI as Supermodel<br/>API
    participant Docker as Docker<br/>Container
    participant Eval as Evaluation<br/>Module

    User->>Harness: run_evaluation(config)
    Harness->>Bench: create_benchmark(..., analysis_type, tasks, api_key, ...)
    Bench->>GitHub: (optional) gh api / fetch PR diff
    GitHub-->>Bench: diff text
    Bench->>Bench: extract_ground_truth / load cached GT
    Bench->>Bench: prepare workspace + REPORT.json
    Bench->>Bench: zip_repo(workspace)
    Bench->>SupermodelAPI: call_supermodel_api(zip) (upload + poll)
    SupermodelAPI-->>Bench: analysis result
    Bench->>Docker: write analysis files into container workspace
    Harness->>Docker: start agent (agent may update REPORT.json)
    Docker-->>Harness: solution / REPORT.json
    Harness->>Bench: evaluate(env, task, solution)
    Bench->>Eval: compute_prf1(predictions, ground_truth)
    Eval-->>Bench: metrics
    Bench-->>Harness: {metrics, resolved}
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🛠️ Two benchmarks land, neat and spry,
Dead code hunted, Supermodel to try.
Zips, diffs, and polling threads hum,
Mirrors updated, Claude 4.6 come.
Tests report metrics — launch the drum! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main addition: a new Supermodel dead-code benchmark plugin with clear focus on the primary deliverable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/supermodel-dead-code-benchmark

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (17)
src/mcpbr/config.py (1)

877-911: Supermodel config fields look well-structured.

A few observations:

  1. resolved_threshold (line 903): The default of 0.8 means a task is "resolved" if both precision AND recall are ≥ 80%. That's a reasonable bar. You might want to add a validator to ensure it's in the [0.0, 1.0] range — right now nothing stops someone from setting resolved_threshold: 1.5 which would make everything fail.

  2. supermodel_api_timeout (line 898): 900 seconds (15 min) is generous but reasonable for large codebases. The SupermodelBenchmark class already handles this correctly.

  3. supermodel_api_key (line 893): Good that this is optional since the benchmark falls back to SUPERMODEL_API_KEY env var.

🔧 Optional: Add validation for resolved_threshold
+    `@field_validator`("resolved_threshold")
+    `@classmethod`
+    def validate_resolved_threshold(cls, v: float) -> float:
+        """Validate resolved_threshold is between 0 and 1."""
+        if not 0.0 <= v <= 1.0:
+            raise ValueError("resolved_threshold must be between 0.0 and 1.0")
+        return v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/config.py` around lines 877 - 911, Add a Pydantic validator to
enforce resolved_threshold is within [0.0, 1.0]: implement a method (e.g.
validate_resolved_threshold) on the same config model that declares the
resolved_threshold Field, use `@validator`("resolved_threshold") to check the
value is >= 0.0 and <= 1.0 and raise ValueError if not, so assignments like
resolved_threshold = 1.5 are rejected at model validation time.
src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py (1)

1-26: Looks good — stub follows the pattern nicely.

This is a clean stub implementation that matches the other endpoint plugins. One small thing: the extract_ground_truth method on line 25 is missing type hints for the parameters and return type that the base class defines. Not a blocker, but adding them keeps things consistent:

def extract_ground_truth(
    self, repo: str, pr_number: int, language: str = "typescript", scope_prefix: str | None = None
) -> list[dict]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py` around lines 1 -
26, The extract_ground_truth method in TestCoveragePlugin is missing type
annotations; update TestCoveragePlugin.extract_ground_truth signature to match
the base class by adding parameter and return type hints (repo: str, pr_number:
int, language: str = "typescript", scope_prefix: str | None = None) and a return
type of list[dict] so the method signature aligns with the base EndpointPlugin
definition.
src/mcpbr/benchmarks/supermodel/api_client.py (2)

89-94: Lazy imports work but are a bit unusual.

Moving import tempfile and import os to the top of the file would be more conventional. Not a blocker — the current approach works fine and has minimal overhead since the import only happens once per polling sequence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/api_client.py` around lines 89 - 94, Move the
lazy imports out of the polling block: add "import tempfile" (and "import os" if
used elsewhere) to the module-level imports at the top of
src/mcpbr/benchmarks/supermodel/api_client.py, then remove the local "import
tempfile" inside the poll_dummy_path handling so the code that creates the
NamedTemporaryFile (poll_dummy.write and poll_dummy.name usage) uses the
top-level tempfile import instead.

74-74: Consider wrapping JSON decode in a try/except for better error messages.

If the API returns non-JSON (like an HTML error page from a gateway timeout), json.loads(stdout.decode()) will raise a JSONDecodeError with a confusing message. A small wrapper could help:

💡 Suggested improvement
-    response = json.loads(stdout.decode())
+    try:
+        response = json.loads(stdout.decode())
+    except json.JSONDecodeError as e:
+        raise RuntimeError(f"Supermodel API returned invalid JSON: {stdout.decode()[:500]}") from e

Same applies to line 126 inside the polling loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/api_client.py` at line 74, Wrap the
json.loads(stdout.decode()) calls (the one assigning to response and the one
inside the polling loop) in a try/except catching json.JSONDecodeError, include
the raw decoded stdout (or a snippet) and the original exception in the error
message, and either raise a clearer exception (e.g., ValueError with context) or
log and re-raise so callers see the API response that caused the parse failure;
specifically change the json.loads(stdout.decode()) usage that sets response and
the identical call at the polling loop to this guarded pattern.
src/mcpbr/benchmarks/supermodel/endpoints/base.py (1)

91-94: Minor: should_skip_file could raise on invalid regex patterns.

If someone puts a malformed regex in skip_patterns, re.search(p, filepath) will raise re.error. In practice this is unlikely since patterns are config-controlled, but a try/except or pre-validation could make debugging easier:

# Example: this would blow up
skip_patterns = ["[invalid"]  # unclosed bracket

Not a blocker — just something to keep in mind if you expose this to less careful config authors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/base.py` around lines 91 - 94,
should_skip_file currently calls re.search(p, filepath) for each pattern and can
raise re.error on malformed regexes; update should_skip_file to either
precompile skip_patterns (using re.compile) and catch re.error during
compilation or wrap the per-pattern re.search in a try/except catching re.error,
then handle invalid patterns by logging a warning or skipping that pattern and
continuing (or re-raising a clearer ValueError with context) so a bad regex in
skip_patterns doesn't crash the function.
src/mcpbr/pricing.py (1)

34-54: Pricing entries look correct and consistent.

The new Claude 4.6 pricing follows the same structure as existing entries. One tiny thing: line 6 says "Pricing is per million tokens (MTok) and is current as of January 2026" but you're adding March 2026 models here. Might want to bump that date to keep the header accurate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/pricing.py` around lines 34 - 54, Update the header that states
"Pricing is per million tokens (MTok) and is current as of January 2026" to
reflect March 2026 so it matches the newly added Claude 4.6 entries (e.g., the
ModelPricing entries "claude-opus-4-6" and "claude-sonnet-4-6"); locate the
top-of-file header/comment in the same module that describes the pricing date
and change "January 2026" to "March 2026".
src/mcpbr/benchmarks/supermodel/git_utils.py (3)

71-88: Consider adding a timeout to get_pre_merge_commit.

This is the only subprocess call in the file without a timeout. If the GitHub API is slow or hangs, this could block indefinitely. The async functions use asyncio.wait_for with timeouts, but this sync function doesn't have the same protection.

🔧 Add a timeout for consistency
     result = subprocess.run(
         ["gh", "api", f"repos/{repo}/commits/{merge_commit}", "--jq", ".parents[0].sha"],
         capture_output=True,
         text=True,
+        timeout=60,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 71 - 88, The
synchronous get_pre_merge_commit function currently calls subprocess.run without
a timeout; update the subprocess.run invocation in get_pre_merge_commit to
include a timeout (e.g., timeout=30) so the call cannot hang indefinitely, and
handle the TimeoutExpired exception by raising a RuntimeError with a clear
message including the merge_commit and the exception details; ensure you modify
only the subprocess.run call and add the TimeoutExpired exception handling near
get_pre_merge_commit so behavior matches the async functions' timeout
protection.

169-179: Edge case: exclude patterns that are absolute paths won't work correctly.

So the logic here is: if there's a scope_prefix and a pattern doesn't already start with it, prepend the prefix. But imagine someone passes an absolute path pattern like /node_modules/* - it'll get the prefix prepended and become something like packages/api//node_modules/* which probably isn't what they wanted.

In practice, the base_excludes list uses relative patterns, so this is fine for the defaults. Just something to be aware of if external callers pass weird patterns.

💡 Optional: add a check for absolute patterns
     prefixed_excludes = []
     for pattern in base_excludes + (exclude_patterns or []):
-        if scope_prefix and not pattern.startswith(scope_prefix):
+        if scope_prefix and not pattern.startswith(scope_prefix) and not pattern.startswith("/"):
             prefixed_excludes.append(f"{scope_prefix}/{pattern}")
         else:
             prefixed_excludes.append(pattern)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 169 - 179, The
current loop that builds prefixed_excludes blindly prepends scope_prefix to
patterns (base_excludes and exclude_patterns), which corrupts absolute paths
like "/node_modules/*"; update the logic in the block that constructs
prefixed_excludes (the loop referencing scope_prefix, base_excludes,
exclude_patterns and variable pattern) to first detect absolute patterns (use
os.path.isabs(pattern) or equivalent) and skip prefixing for those, i.e., if
pattern is absolute leave it unchanged, otherwise apply the existing
scope_prefix-prefixing check; ensure the rest of the code that extends cmd with
"-x" uses the adjusted prefixed_excludes.

122-141: git archive ignores exclude_patterns - is that intentional?

When the repo has a .git directory, you use git archive which is great because it respects .gitignore. But the exclude_patterns parameter is only used in the fallback path (_zip_repo_fallback), not in _zip_repo_git_archive.

If a caller explicitly passes exclude_patterns, they might expect those to apply regardless of whether it's a git repo or not. Since git archive naturally excludes untracked files anyway, this might be fine in practice, but it's a slight API inconsistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 122 - 141,
_zip_repo_git_archive currently ignores any exclude_patterns; update it to
accept an exclude_patterns parameter (same shape as used by _zip_repo_fallback)
and apply those exclusions even when using git. Implement by listing tracked
files with "git ls-files -z" (run from repo_dir), filter that list against
exclude_patterns, and then create the zip from the remaining paths (either by
passing the filtered paths to git archive if possible or by building the zip in
Python from those files); reference _zip_repo_git_archive and _zip_repo_fallback
so the caller behavior is consistent and exclusions are applied in both code
paths.
src/mcpbr/benchmarks/supermodel/benchmark.py (5)

151-157: filter_category compares against language field - might be confusing.

On line 157, you filter by filter_category but compare against t.get("language", "typescript"). This works, but the naming is a bit asymmetric - a user might expect filter_category to filter on a category field, not language.

For example, if I call load_tasks(filter_category=["python"]), it filters tasks where language == "python". Makes sense in context, but the naming could trip someone up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 151 - 157, The
code filters tasks using filter_category but checks t.get("language", ...),
which is confusing; either change the key to "category" or rename the filter to
match "language". Update the filtering line inside the function that uses
filter_category (the list comprehension over tasks referencing t.get("language",
"typescript")) to t.get("category", "default_category") if you want to keep the
filter name, or rename the parameter filter_category -> filter_language and
update all callers and the comprehension to use the new name so the intent and
field name align.

615-639: Git subprocess calls are missing timeouts.

These subprocess.run calls for initializing the git repo don't have timeouts. The DeadCodeBenchmark in the other file does add timeouts (e.g., timeout=30 for git init). For consistency and reliability, consider adding timeouts here too.

In practice, git init and git config are near-instant, but git add -A on a large repo could potentially hang if there's filesystem issues.

🔧 Add timeouts for consistency
-        subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False)
+        subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False, timeout=30)
         subprocess.run(
             ["git", "config", "user.email", "mcpbr@test.com"],
             cwd=host_workdir,
             capture_output=True,
             check=False,
+            timeout=10,
         )
         # ... same for the other calls
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 615 - 639, The git
subprocess.run invocations that initialize and configure the repo (the calls
using ["git", "init"], ["git", "config", ...], ["git", "add", "-A"], and ["git",
"commit", "-m", "Initial"] with cwd=host_workdir) lack timeouts; update each
subprocess.run to include a reasonable timeout (e.g., timeout=30) to match the
pattern used in DeadCodeBenchmark and prevent hangs, keeping capture_output=True
and check=False unchanged.

574-581: Catching broad Exception is intentional here, but consider logging the stack trace.

I see the static analysis flagged this. In a benchmark runner, it makes sense to catch all exceptions so one failing task doesn't kill the whole run. You're already logging the error and printing the traceback, which is good.

One thing: logger.error on line 575 only gets the error message, while the print on 576-581 gets the full traceback. If someone is only watching logs (not stderr), they'd miss the traceback. Consider using logger.exception() instead which automatically includes the traceback.

🔧 Use logger.exception for full traceback in logs
             except Exception as e:
-                logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}")
+                logger.exception(f"Failed to get Supermodel analysis for {instance_id}")
                 print(
                     f"\n*** SUPERMODEL ANALYSIS FAILED for {instance_id} ***\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 574 - 581, In the
except block that currently logs "Failed to get Supermodel analysis for
{instance_id}: {e}" (the handler catching Exception as e for Supermodel
analysis), change the log call to include the traceback—either replace
logger.error(...) with logger.exception(...) or call logger.error(...,
exc_info=True)—so the full stack trace for the failure of instance_id is
recorded in the logs (leave the stderr print as-is).

83-83: Temp directory is created but never explicitly cleaned up.

The self._work_dir is created in __init__ using tempfile.mkdtemp(), but there's no corresponding cleanup in a destructor or context manager. For short-lived benchmark runs this is fine (OS cleans it up on exit), but if you run multiple benchmarks in a loop or have a long-running process, these temp directories will accumulate.

💡 Consider adding cleanup

You could either:

  1. Add a cleanup() method that users call explicitly
  2. Use tempfile.TemporaryDirectory and keep a reference to it
  3. Add a __del__ method (though those can be unreliable)
# Option 1: Explicit cleanup method
def cleanup(self) -> None:
    """Remove temporary work directory."""
    if self._work_dir.exists():
        shutil.rmtree(self._work_dir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 83, The temp work
directory created in __init__ as self._work_dir via tempfile.mkdtemp() is never
cleaned up; add explicit cleanup by either (a) switching to
tempfile.TemporaryDirectory and storing the object (e.g. self._tmpdir) so the
directory is removed when the object is closed or deleted, or (b) add a public
cleanup() method that calls shutil.rmtree(self._work_dir) (guarded by exists()),
and call cleanup from any teardown path; make sure to import shutil if using
rmtree and update any code that relied on self._work_dir to refer to
self._tmpdir.name if you choose the TemporaryDirectory approach.

361-365: Accessing private methods on docker_manager.

Calling docker_manager._ensure_fallback_image() and later docker_manager._temp_dirs.append(temp_dir) accesses private attributes (indicated by the _ prefix). This creates tight coupling to the internal implementation of DockerEnvironmentManager.

This might be intentional if the benchmark protocol expects benchmarks to manage their own temp directories this way, but it means changes to DockerEnvironmentManager's internals could break this code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 361 - 365, This
code accesses DockerEnvironmentManager internals by calling
docker_manager._ensure_fallback_image() and appending to
docker_manager._temp_dirs; instead call a public API (e.g.,
docker_manager.ensure_fallback_image()) and use a public method to register temp
dirs (e.g., docker_manager.register_temp_dir(temp_dir) or
docker_manager.create_temp_dir(prefix)) or, if those methods don't exist, add
them to DockerEnvironmentManager (implement ensure_fallback_image() that wraps
_ensure_fallback_image and register_temp_dir(temp_dir) that appends to the
internal list) and then replace the direct accesses to _ensure_fallback_image
and _temp_dirs with those public methods while still using
docker_manager.FALLBACK_IMAGE and temp_dir/instance_id as before.
src/mcpbr/benchmarks/deadcode.py (2)

455-481: Nice fallback extraction logic - duplicated from SupermodelBenchmark.

The _extract_findings_from_text method here is nearly identical to the one in SupermodelBenchmark. This is copy-paste duplication.

Consider extracting this to a shared utility, maybe in evaluation.py or a new utils.py:

# In evaluation.py or utils.py
def extract_dead_code_from_text(text: str) -> list[dict]:
    """Extract dead_code array from text containing JSON."""
    ...

Not urgent since it's only in two places, but if you add more benchmarks that need this pattern, it'd be worth consolidating.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/deadcode.py` around lines 455 - 481, The
_extract_findings_from_text method is duplicated (also in SupermodelBenchmark);
refactor by extracting the JSON-array-parsing logic into a shared function like
extract_dead_code_from_text(text: str) -> list[dict[str, Any]] placed in
evaluation.py or a new utils.py, move the extraction code (including the
try/except for json.JSONDecodeError/ValueError and the same bracket-matching
logic) into that function, update _extract_findings_from_text in this class and
the SupermodelBenchmark to call the new extract_dead_code_from_text, and add the
necessary import where used so both benchmarks reuse the single implementation.

49-50: shutil.rmtree could fail on permission issues.

If corpus_dir exists but has permission problems (e.g., some files are read-only), shutil.rmtree will raise an exception. This would crash the task loading process rather than gracefully degrading.

Since this is a cache directory that the user might not care about, you could wrap this in a try/except:

🔧 Handle rmtree failures gracefully
         if corpus_dir.exists():
-            shutil.rmtree(corpus_dir)
+            try:
+                shutil.rmtree(corpus_dir)
+            except OSError as e:
+                # If we can't remove the old cache, try cloning anyway
+                import logging
+                logging.getLogger("mcpbr.deadcode").warning(f"Could not remove old cache: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/deadcode.py` around lines 49 - 50, Wrap the call to
shutil.rmtree(corpus_dir) in a try/except so permission errors don't crash
loading: in src/mcpbr/benchmarks/deadcode.py surround the existing if
corpus_dir.exists(): shutil.rmtree(corpus_dir) with a try block, catch
OSError/Exception, and log a non-fatal warning that includes the corpus_dir path
and the exception (e.g. via logging.getLogger(__name__).warning(...)) so the
code continues if deletion fails; alternatively, you can use shutil.rmtree(...,
onerror=...) to attempt os.chmod before retrying, but ensure failure is handled
gracefully and does not propagate.
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)

202-248: Method signature differs from base class - could cause polymorphic call issues.

The base EndpointPlugin.parse_api_response(self, response: dict) -> dict takes just response, but DeadCodePlugin.parse_api_response adds prefilter_types: bool = False.

This works because of the default value, but if the benchmark code ever calls endpoint.parse_api_response(response, True) on a generic EndpointPlugin, you'd get a TypeError for other endpoint implementations.

Looking at benchmark.py line 676, it calls self._endpoint.parse_api_response(raw_response) without the extra arg, so you're safe for now. Just something to keep in mind if you add more endpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 202 -
248, DeadCodePlugin.parse_api_response currently adds a second parameter
prefilter_types which diverges from the base EndpointPlugin.parse_api_response
signature and can break polymorphic calls; change
DeadCodePlugin.parse_api_response to match the base signature
(parse_api_response(self, response: dict) -> dict) and accept the optional flag
via kwargs or response metadata instead (e.g. extract prefilter_types =
kwargs.get("prefilter_types", False) if you add **kwargs, or read
response.get("_prefilter_types", False)), update references to prefilter_types
inside the method accordingly, and ensure calls that previously passed the flag
either pass it via kwargs or set it on the response/metadata so behavior remains
the same while preserving polymorphism with EndpointPlugin.parse_api_response.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 420-422: DeadCodeBenchmark currently computes false positives as
fp = len(found_set & alive_set) which only counts findings explicitly labeled
alive; change it to count any predicted item not in the dead_set (consistent
with SupermodelBenchmark/compute_prf1) by computing fp = len(found_set -
dead_set) so that any prediction outside the ground-truth dead_set is treated as
an FP; update references around the fp calculation in the DeadCodeBenchmark
function/block (variables: found_set, dead_set, alive_set) and ensure downstream
metrics use the new fp value.

---

Nitpick comments:
In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 455-481: The _extract_findings_from_text method is duplicated
(also in SupermodelBenchmark); refactor by extracting the JSON-array-parsing
logic into a shared function like extract_dead_code_from_text(text: str) ->
list[dict[str, Any]] placed in evaluation.py or a new utils.py, move the
extraction code (including the try/except for json.JSONDecodeError/ValueError
and the same bracket-matching logic) into that function, update
_extract_findings_from_text in this class and the SupermodelBenchmark to call
the new extract_dead_code_from_text, and add the necessary import where used so
both benchmarks reuse the single implementation.
- Around line 49-50: Wrap the call to shutil.rmtree(corpus_dir) in a try/except
so permission errors don't crash loading: in src/mcpbr/benchmarks/deadcode.py
surround the existing if corpus_dir.exists(): shutil.rmtree(corpus_dir) with a
try block, catch OSError/Exception, and log a non-fatal warning that includes
the corpus_dir path and the exception (e.g. via
logging.getLogger(__name__).warning(...)) so the code continues if deletion
fails; alternatively, you can use shutil.rmtree(..., onerror=...) to attempt
os.chmod before retrying, but ensure failure is handled gracefully and does not
propagate.

In `@src/mcpbr/benchmarks/supermodel/api_client.py`:
- Around line 89-94: Move the lazy imports out of the polling block: add "import
tempfile" (and "import os" if used elsewhere) to the module-level imports at the
top of src/mcpbr/benchmarks/supermodel/api_client.py, then remove the local
"import tempfile" inside the poll_dummy_path handling so the code that creates
the NamedTemporaryFile (poll_dummy.write and poll_dummy.name usage) uses the
top-level tempfile import instead.
- Line 74: Wrap the json.loads(stdout.decode()) calls (the one assigning to
response and the one inside the polling loop) in a try/except catching
json.JSONDecodeError, include the raw decoded stdout (or a snippet) and the
original exception in the error message, and either raise a clearer exception
(e.g., ValueError with context) or log and re-raise so callers see the API
response that caused the parse failure; specifically change the
json.loads(stdout.decode()) usage that sets response and the identical call at
the polling loop to this guarded pattern.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 151-157: The code filters tasks using filter_category but checks
t.get("language", ...), which is confusing; either change the key to "category"
or rename the filter to match "language". Update the filtering line inside the
function that uses filter_category (the list comprehension over tasks
referencing t.get("language", "typescript")) to t.get("category",
"default_category") if you want to keep the filter name, or rename the parameter
filter_category -> filter_language and update all callers and the comprehension
to use the new name so the intent and field name align.
- Around line 615-639: The git subprocess.run invocations that initialize and
configure the repo (the calls using ["git", "init"], ["git", "config", ...],
["git", "add", "-A"], and ["git", "commit", "-m", "Initial"] with
cwd=host_workdir) lack timeouts; update each subprocess.run to include a
reasonable timeout (e.g., timeout=30) to match the pattern used in
DeadCodeBenchmark and prevent hangs, keeping capture_output=True and check=False
unchanged.
- Around line 574-581: In the except block that currently logs "Failed to get
Supermodel analysis for {instance_id}: {e}" (the handler catching Exception as e
for Supermodel analysis), change the log call to include the traceback—either
replace logger.error(...) with logger.exception(...) or call logger.error(...,
exc_info=True)—so the full stack trace for the failure of instance_id is
recorded in the logs (leave the stderr print as-is).
- Line 83: The temp work directory created in __init__ as self._work_dir via
tempfile.mkdtemp() is never cleaned up; add explicit cleanup by either (a)
switching to tempfile.TemporaryDirectory and storing the object (e.g.
self._tmpdir) so the directory is removed when the object is closed or deleted,
or (b) add a public cleanup() method that calls shutil.rmtree(self._work_dir)
(guarded by exists()), and call cleanup from any teardown path; make sure to
import shutil if using rmtree and update any code that relied on self._work_dir
to refer to self._tmpdir.name if you choose the TemporaryDirectory approach.
- Around line 361-365: This code accesses DockerEnvironmentManager internals by
calling docker_manager._ensure_fallback_image() and appending to
docker_manager._temp_dirs; instead call a public API (e.g.,
docker_manager.ensure_fallback_image()) and use a public method to register temp
dirs (e.g., docker_manager.register_temp_dir(temp_dir) or
docker_manager.create_temp_dir(prefix)) or, if those methods don't exist, add
them to DockerEnvironmentManager (implement ensure_fallback_image() that wraps
_ensure_fallback_image and register_temp_dir(temp_dir) that appends to the
internal list) and then replace the direct accesses to _ensure_fallback_image
and _temp_dirs with those public methods while still using
docker_manager.FALLBACK_IMAGE and temp_dir/instance_id as before.

In `@src/mcpbr/benchmarks/supermodel/endpoints/base.py`:
- Around line 91-94: should_skip_file currently calls re.search(p, filepath) for
each pattern and can raise re.error on malformed regexes; update
should_skip_file to either precompile skip_patterns (using re.compile) and catch
re.error during compilation or wrap the per-pattern re.search in a try/except
catching re.error, then handle invalid patterns by logging a warning or skipping
that pattern and continuing (or re-raising a clearer ValueError with context) so
a bad regex in skip_patterns doesn't crash the function.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 202-248: DeadCodePlugin.parse_api_response currently adds a second
parameter prefilter_types which diverges from the base
EndpointPlugin.parse_api_response signature and can break polymorphic calls;
change DeadCodePlugin.parse_api_response to match the base signature
(parse_api_response(self, response: dict) -> dict) and accept the optional flag
via kwargs or response metadata instead (e.g. extract prefilter_types =
kwargs.get("prefilter_types", False) if you add **kwargs, or read
response.get("_prefilter_types", False)), update references to prefilter_types
inside the method accordingly, and ensure calls that previously passed the flag
either pass it via kwargs or set it on the response/metadata so behavior remains
the same while preserving polymorphism with EndpointPlugin.parse_api_response.

In `@src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py`:
- Around line 1-26: The extract_ground_truth method in TestCoveragePlugin is
missing type annotations; update TestCoveragePlugin.extract_ground_truth
signature to match the base class by adding parameter and return type hints
(repo: str, pr_number: int, language: str = "typescript", scope_prefix: str |
None = None) and a return type of list[dict] so the method signature aligns with
the base EndpointPlugin definition.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 71-88: The synchronous get_pre_merge_commit function currently
calls subprocess.run without a timeout; update the subprocess.run invocation in
get_pre_merge_commit to include a timeout (e.g., timeout=30) so the call cannot
hang indefinitely, and handle the TimeoutExpired exception by raising a
RuntimeError with a clear message including the merge_commit and the exception
details; ensure you modify only the subprocess.run call and add the
TimeoutExpired exception handling near get_pre_merge_commit so behavior matches
the async functions' timeout protection.
- Around line 169-179: The current loop that builds prefixed_excludes blindly
prepends scope_prefix to patterns (base_excludes and exclude_patterns), which
corrupts absolute paths like "/node_modules/*"; update the logic in the block
that constructs prefixed_excludes (the loop referencing scope_prefix,
base_excludes, exclude_patterns and variable pattern) to first detect absolute
patterns (use os.path.isabs(pattern) or equivalent) and skip prefixing for
those, i.e., if pattern is absolute leave it unchanged, otherwise apply the
existing scope_prefix-prefixing check; ensure the rest of the code that extends
cmd with "-x" uses the adjusted prefixed_excludes.
- Around line 122-141: _zip_repo_git_archive currently ignores any
exclude_patterns; update it to accept an exclude_patterns parameter (same shape
as used by _zip_repo_fallback) and apply those exclusions even when using git.
Implement by listing tracked files with "git ls-files -z" (run from repo_dir),
filter that list against exclude_patterns, and then create the zip from the
remaining paths (either by passing the filtered paths to git archive if possible
or by building the zip in Python from those files); reference
_zip_repo_git_archive and _zip_repo_fallback so the caller behavior is
consistent and exclusions are applied in both code paths.

In `@src/mcpbr/config.py`:
- Around line 877-911: Add a Pydantic validator to enforce resolved_threshold is
within [0.0, 1.0]: implement a method (e.g. validate_resolved_threshold) on the
same config model that declares the resolved_threshold Field, use
`@validator`("resolved_threshold") to check the value is >= 0.0 and <= 1.0 and
raise ValueError if not, so assignments like resolved_threshold = 1.5 are
rejected at model validation time.

In `@src/mcpbr/pricing.py`:
- Around line 34-54: Update the header that states "Pricing is per million
tokens (MTok) and is current as of January 2026" to reflect March 2026 so it
matches the newly added Claude 4.6 entries (e.g., the ModelPricing entries
"claude-opus-4-6" and "claude-sonnet-4-6"); locate the top-of-file
header/comment in the same module that describes the pricing date and change
"January 2026" to "March 2026".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 43bdb941-4a9a-45b7-8d00-46b0623b59a9

📥 Commits

Reviewing files that changed from the base of the PR and between ad016d8 and c05fa91.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • Dockerfile
  • pyproject.toml
  • src/mcpbr/__init__.py
  • src/mcpbr/benchmarks/__init__.py
  • src/mcpbr/benchmarks/deadcode.py
  • src/mcpbr/benchmarks/supermodel/__init__.py
  • src/mcpbr/benchmarks/supermodel/api_client.py
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/__init__.py
  • src/mcpbr/benchmarks/supermodel/endpoints/base.py
  • src/mcpbr/benchmarks/supermodel/endpoints/circular_deps.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/benchmarks/supermodel/endpoints/impact_analysis.py
  • src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py
  • src/mcpbr/benchmarks/supermodel/evaluation.py
  • src/mcpbr/benchmarks/supermodel/git_utils.py
  • src/mcpbr/config.py
  • src/mcpbr/docker_env.py
  • src/mcpbr/harness.py
  • src/mcpbr/models.py
  • src/mcpbr/pricing.py

Copy link
Copy Markdown
Contributor

@jonathanpopham jonathanpopham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Ghost spectral analysis

+2284 / -382 across 22 files. Architecturally sound plugin-based design with solid separation of PR-based vs corpus-based benchmarks.

Critical (fix before merge)

  1. FP calculation wrong in DeadCodeBenchmark — undercounts false positives, inflates precision
  2. Task dict mutation — breaks A/B baseline/enhanced comparisons
  3. Sync subprocess.run in async context — blocks event loop, no timeout
  4. Path traversal in corpus file writing — no containment check
  5. API key in process tablecurl -H exposes key in ps aux

Important

  1. Hardcoded "dead_code" key makes multi-analysis-type support non-functional
  2. _work_dir temp directory leaks on every run
  3. json.loads on raw curl output with no error handling
  4. Only globs *.ts — misses .tsx, .js, .jsx, .mjs
  5. Git subprocess calls missing timeouts (unlike DeadCodeBenchmark which has them)

Minor / Cleanup

  1. _score_and_cap_candidates is dead code
  2. prefilter_types parameter is dead — duplicated filtering logic
  3. reason_filtered always reports 0
  4. Broad S608 lint suppression covers all benchmark files
  5. Significant code duplication between the two benchmarks: _extract_findings_from_text, REPORT_PLACEHOLDER, git init sequence, Docker manager private attribute access — worth extracting to shared utilities

alive_set = {(a.get("file", ""), a.get("name", "")) for a in expected_alive}

tp = len(found_set & dead_set)
fp = len(found_set & alive_set)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: FP calculation undercounts false positives — corrupts precision metric.

fp = len(found_set & alive_set)

This only counts FPs that happen to be in the explicit alive_set. If the agent reports a symbol that's in neither dead_set nor alive_set (hallucinated, or just not in ground truth), it's silently ignored — not counted as TP or FP. This inflates precision.

SupermodelBenchmark does it correctly via compute_prf1: fp = len(pred_set - gt_set). The two benchmarks produce incomparable precision metrics for the same concept.

Fix:

fp = len(found_set - dead_set)

"problem_statement_enhanced", task["problem_statement"]
)
else:
task["problem_statement"] = task.get(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Mutating shared task dict breaks A/B comparisons.

This overwrites task["problem_statement"] in-place on the original dict returned by load_tasks(). If the same task is used for both baseline and enhanced conditions (standard A/B benchmarking), the second call reads the already-overwritten value — the baseline agent could get the enhanced prompt or vice versa.

Fix: Copy before mutating:

task = {**task}
if is_mcp:
    task["problem_statement"] = task.get("problem_statement_enhanced", task["problem_statement"])

Returns:
SHA of the first parent commit.
"""
result = subprocess.run(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Synchronous subprocess.run in async context blocks the event loop.

Every other git/subprocess call in this module uses asyncio.create_subprocess_exec. This one uses blocking subprocess.run and is called from async create_environment(). It will stall the event loop on slow GitHub API responses.

Also missing a timeout parameter — if gh hangs (rate limiting, auth prompt), it blocks the entire process indefinitely.

Fix: Use asyncio.create_subprocess_exec with asyncio.wait_for(..., timeout=30).


# Write all files including REPORT.json
for file_path, content in repo_content.items():
full_path = Path(host_workdir) / file_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Potential path traversal.

If corpus repo_content keys contain ../../etc/passwd, Path(host_workdir) / file_path resolves outside the workdir. No path sanitization or containment check.

Since the corpus is loaded from a cloneable git repo, this is exploitable if the corpus is tampered with.

Fix: Resolve and verify containment:

full_path = (Path(host_workdir) / file_path).resolve()
assert full_path.is_relative_to(Path(host_workdir).resolve())

f"Idempotency-Key: {idempotency_key}",
]
if api_key:
headers.extend(["-H", f"X-Api-Key: {api_key}"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: API key visible in process table.

Shelling out to curl with -H "X-Api-Key: ..." exposes the key in ps aux / /proc/*/cmdline. Consider using httpx/aiohttp (already in the dep tree), or at minimum pass the header via --header @<(echo ...) or a temp file with restrictive permissions.

# failure tags real dead code with the same reason as framework-
# wired code). See issue #676 for details.
root_files = metadata.get("rootFilesCount", 0) or 0
reason_filtered = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading metric: reason_filtered is always 0.

Initialized here but never incremented anywhere. Reported in metadata as "reasonFiltered": reason_filtered which always shows 0. Either implement the filtering or remove the variable to avoid confusion.

than a false positive.
"""

def parse_api_response(self, response: dict, prefilter_types: bool = False) -> dict:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead parameter: prefilter_types is never True in practice.

Defaults to False and the only caller (benchmark.py) never passes True. The type/interface filtering that this would enable is done separately in create_environment() instead. The two filtering paths are duplicated but disconnected — maintaining both is a hazard.

Suggestion: Pick one location for the filtering logic and remove the other.

)

# Init git so the harness can track modifications
subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timeouts on git subprocess calls.

DeadCodeBenchmark correctly uses timeout=30 on all five git init/config/add/commit calls. These have no timeout. git add -A on a large repo can hang indefinitely.

pyproject.toml Outdated
"tests/**/*.py" = ["S", "T20"]
"infrastructure/**/*.py" = ["S603", "S607"]
"src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"]
"src/mcpbr/benchmarks/**/*.py" = ["S608"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad lint suppression.

S608 (SQL injection detection) is suppressed for the entire benchmarks/ directory. The PR description says this is for false positives on f-string prompt templates, but the suppression covers all current and future benchmark files. Should be scoped to the specific files that need it:

"src/mcpbr/benchmarks/deadcode.py" = ["S608"]
"src/mcpbr/benchmarks/supermodel/benchmark.py" = ["S608"]

raw_b = item.get(fb, "")
a = normalize_path(raw_a) if fa in path_like_fields else normalize_name(raw_a)
b = normalize_path(raw_b) if fb in path_like_fields else normalize_name(raw_b)
if a and b:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent data loss: items with empty fields are dropped.

If a ground truth or prediction item has an empty name (e.g. a default export, or an unnamed function), it's silently excluded from the comparison set. This could artificially inflate recall.

Consider logging a warning when items are dropped, or using a sentinel value for empty names.

…ment

The `is_mcp` flag was never passed from the harness (Protocol signature doesn't
include it), so the analysis JSON was never placed in the workdir and agents
always ran with the baseline prompt instead of the enhanced analysis prompt.

Remove the `is_mcp` parameter to match the Benchmark Protocol signature, and
always place the Supermodel analysis JSON and use the enhanced prompt. The
SupermodelBenchmark is always used in enhanced (MCP) mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
src/mcpbr/benchmarks/supermodel/benchmark.py (5)

29-33: ⚠️ Potential issue | 🟠 Major

Hardcoded "dead_code" key breaks other analysis types.

Line 30 and Line 704 force dead-code schema even when endpoint is impact_analysis, test_coverage, or circular_deps. This yields empty findings for non-dead-code modes.

Suggested fix
-REPORT_PLACEHOLDER = """{
-  "dead_code": [],
-  "analysis_complete": false
-}
-"""
+def _report_placeholder(findings_key: str) -> str:
+    return json.dumps({findings_key: [], "analysis_complete": False}, indent=2)
-        report_path.write_text(REPORT_PLACEHOLDER)
+        findings_key = self._endpoint.name()
+        report_path.write_text(_report_placeholder(findings_key))
-                agent_findings = report.get("dead_code", [])
+                findings_key = self._endpoint.name()
+                agent_findings = report.get(findings_key, [])

Also applies to: 704-704

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 29 - 33, The
REPORT_PLACEHOLDER currently hardcodes a dead-code schema (keys "dead_code" and
"analysis_complete") which forces that shape for all analysis endpoints and
causes empty findings for non-dead-code modes; update the logic that emits
REPORT_PLACEHOLDER (symbol REPORT_PLACEHOLDER) so it builds a placeholder
dynamically based on the active endpoint (e.g., "impact_analysis",
"test_coverage", "circular_deps") or remove the dead_code key and return a
generic empty result structure per endpoint, ensuring the placeholder structure
matches each analysis type's expected keys rather than always using "dead_code".

330-332: ⚠️ Potential issue | 🟠 Major

In-place task mutation can corrupt baseline/enhanced A/B runs.

Line 330 mutates the original task dict, so a second pass can inherit the wrong prompt condition. Use a shallow copy before rewriting problem_statement.

Suggested fix
-        # Always use the enhanced prompt (with analysis JSON)
-        task["problem_statement"] = task.get(
+        # Always use the enhanced prompt (with analysis JSON)
+        task = dict(task)
+        task["problem_statement"] = task.get(
             "problem_statement_enhanced", task["problem_statement"]
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 330 - 332, The
code mutates the original task dict by assigning to task["problem_statement"]
which can leak prompt conditions between runs; instead create a shallow copy of
the task before modifying it (e.g., new_task = task.copy()) and rewrite
new_task["problem_statement"] using the existing
task.get("problem_statement_enhanced", task["problem_statement"]) expression,
then use new_task for downstream processing (refer to the variable task and the
"problem_statement" / "problem_statement_enhanced" keys in benchmark.py).

467-467: ⚠️ Potential issue | 🟡 Minor

reasonFiltered is always reported as 0.

Line 467 initializes reason_filtered but it is never incremented, yet Line 505 and Line 562 report it as a meaningful metric.

Also applies to: 505-505, 562-563

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 467, The variable
reason_filtered is initialized but never incremented, so update the code paths
that determine a filtered reason (the branches that later report reason_filtered
at the points corresponding to the logic around where reports are made) to
increment reason_filtered whenever an item is filtered for a reason;
specifically, locate the conditional branches that set or log the filtering
reason (referencing the reason_filtered variable and the report sites around
lines where reports occur) and add reason_filtered += 1 in each branch (or
ensure the increment happens centrally where a filter decision is made), taking
care to respect scope (declare nonlocal/global if inside a nested function) so
the counter reflects actual filtered items.

605-629: ⚠️ Potential issue | 🟠 Major

Git subprocess setup needs timeouts and stronger execution safety.

At Line 605–629, all git commands run with check=False and no timeout, so hangs/failures are silently ignored. Also using "git" relies on PATH lookup (S607).

Suggested fix
+        git_bin = shutil.which("git") or "git"
-        subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False)
+        subprocess.run([git_bin, "init"], cwd=host_workdir, capture_output=True, check=True, timeout=30)
@@
-            ["git", "config", "user.email", "mcpbr@test.com"],
+            [git_bin, "config", "user.email", "mcpbr@test.com"],
             cwd=host_workdir,
             capture_output=True,
-            check=False,
+            check=True,
+            timeout=30,
@@
-            ["git", "config", "user.name", "MCPBR"],
+            [git_bin, "config", "user.name", "MCPBR"],
             cwd=host_workdir,
             capture_output=True,
-            check=False,
+            check=True,
+            timeout=30,
@@
-            ["git", "add", "-A"],
+            [git_bin, "add", "-A"],
             cwd=host_workdir,
             capture_output=True,
-            check=False,
+            check=True,
+            timeout=30,
@@
-            ["git", "commit", "-m", "Initial"],
+            [git_bin, "commit", "-m", "Initial"],
             cwd=host_workdir,
             capture_output=True,
-            check=False,
+            check=True,
+            timeout=30,
#!/bin/bash
# Verify current git subprocess patterns and compare with other benchmark implementations
rg -n 'subprocess\.run\(' src/mcpbr/benchmarks/supermodel/benchmark.py -C2
rg -n 'subprocess\.run\(' src/mcpbr/benchmarks/deadcode.py -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 605 - 629, The git
invocations in the subprocess.run block (the git init/config/add/commit calls
using host_workdir) currently use check=False and no timeout and rely on "git"
via PATH; update these calls to (1) resolve the git binary with
shutil.which("git") and raise a clear error if not found, (2) use check=True (or
call subprocess.run inside a try/except that catches
subprocess.CalledProcessError) and (3) add a reasonable timeout (e.g. a few
seconds) so failures and hangs surface instead of being silently ignored; apply
this change to the sequence of subprocess.run calls that perform git init, git
config, git add, and git commit in benchmark.py and ensure errors are
logged/propagated appropriately.

83-83: ⚠️ Potential issue | 🟠 Major

Temp workdir is leaked (no cleanup path).

Line 83 creates a process-lifetime temp dir and never removes it. This accumulates clones/zips across runs.

Suggested fix
 class SupermodelBenchmark:
@@
     def __init__(...):
@@
         self._work_dir = Path(tempfile.mkdtemp(prefix="mcpbr_supermodel_"))
+
+    def close(self) -> None:
+        if self._work_dir.exists():
+            shutil.rmtree(self._work_dir, ignore_errors=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 83, The temp workdir
created by tempfile.mkdtemp at self._work_dir is leaked; change initialization
to use a managed TemporaryDirectory or add an explicit cleanup path: create a
tempfile.TemporaryDirectory() and store the context object (e.g.,
self._tmpdir_obj) and set self._work_dir = Path(self._tmpdir_obj.name), or keep
the mkdtemp but implement a cleanup method (e.g., close/cleanup on your
Benchmark class) that calls shutil.rmtree(self._work_dir) and register it with
atexit.register or ensure callers invoke it (and update __del__ or
context-manager support if appropriate) so the directory is removed when the
process or object ends; reference symbols: self._work_dir, tempfile.mkdtemp,
tempfile.TemporaryDirectory, shutil.rmtree, and add a cleanup/close method on
the class.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 565-566: Replace the broad "except Exception as e" that logs
"Failed to get Supermodel analysis for {instance_id}" with narrow exception
handlers for the expected errors (for example catch RuntimeError, OSError and
json.JSONDecodeError explicitly) and log the same message inside those handlers
(referencing instance_id and logger.error); do not swallow other exceptions—let
unexpected exceptions propagate (or re-raise) so genuine bugs surface. Ensure
the handlers reference the same logging message and include the caught exception
details.

---

Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 29-33: The REPORT_PLACEHOLDER currently hardcodes a dead-code
schema (keys "dead_code" and "analysis_complete") which forces that shape for
all analysis endpoints and causes empty findings for non-dead-code modes; update
the logic that emits REPORT_PLACEHOLDER (symbol REPORT_PLACEHOLDER) so it builds
a placeholder dynamically based on the active endpoint (e.g., "impact_analysis",
"test_coverage", "circular_deps") or remove the dead_code key and return a
generic empty result structure per endpoint, ensuring the placeholder structure
matches each analysis type's expected keys rather than always using "dead_code".
- Around line 330-332: The code mutates the original task dict by assigning to
task["problem_statement"] which can leak prompt conditions between runs; instead
create a shallow copy of the task before modifying it (e.g., new_task =
task.copy()) and rewrite new_task["problem_statement"] using the existing
task.get("problem_statement_enhanced", task["problem_statement"]) expression,
then use new_task for downstream processing (refer to the variable task and the
"problem_statement" / "problem_statement_enhanced" keys in benchmark.py).
- Line 467: The variable reason_filtered is initialized but never incremented,
so update the code paths that determine a filtered reason (the branches that
later report reason_filtered at the points corresponding to the logic around
where reports are made) to increment reason_filtered whenever an item is
filtered for a reason; specifically, locate the conditional branches that set or
log the filtering reason (referencing the reason_filtered variable and the
report sites around lines where reports occur) and add reason_filtered += 1 in
each branch (or ensure the increment happens centrally where a filter decision
is made), taking care to respect scope (declare nonlocal/global if inside a
nested function) so the counter reflects actual filtered items.
- Around line 605-629: The git invocations in the subprocess.run block (the git
init/config/add/commit calls using host_workdir) currently use check=False and
no timeout and rely on "git" via PATH; update these calls to (1) resolve the git
binary with shutil.which("git") and raise a clear error if not found, (2) use
check=True (or call subprocess.run inside a try/except that catches
subprocess.CalledProcessError) and (3) add a reasonable timeout (e.g. a few
seconds) so failures and hangs surface instead of being silently ignored; apply
this change to the sequence of subprocess.run calls that perform git init, git
config, git add, and git commit in benchmark.py and ensure errors are
logged/propagated appropriately.
- Line 83: The temp workdir created by tempfile.mkdtemp at self._work_dir is
leaked; change initialization to use a managed TemporaryDirectory or add an
explicit cleanup path: create a tempfile.TemporaryDirectory() and store the
context object (e.g., self._tmpdir_obj) and set self._work_dir =
Path(self._tmpdir_obj.name), or keep the mkdtemp but implement a cleanup method
(e.g., close/cleanup on your Benchmark class) that calls
shutil.rmtree(self._work_dir) and register it with atexit.register or ensure
callers invoke it (and update __del__ or context-manager support if appropriate)
so the directory is removed when the process or object ends; reference symbols:
self._work_dir, tempfile.mkdtemp, tempfile.TemporaryDirectory, shutil.rmtree,
and add a cleanup/close method on the class.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 53afd5d4-a7e1-480c-b6b3-1d5d22be9b3f

📥 Commits

Reviewing files that changed from the base of the PR and between c05fa91 and f071523.

📒 Files selected for processing (1)
  • src/mcpbr/benchmarks/supermodel/benchmark.py

Comment on lines +565 to +566
except Exception as e:
logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Catch narrower exceptions here.

Line 565 catches Exception, which can mask programming bugs and make retries/debugging noisy. Catch expected failures (RuntimeError, OSError, json.JSONDecodeError) and let unexpected ones surface.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 565-565: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 565 - 566, Replace
the broad "except Exception as e" that logs "Failed to get Supermodel analysis
for {instance_id}" with narrow exception handlers for the expected errors (for
example catch RuntimeError, OSError and json.JSONDecodeError explicitly) and log
the same message inside those handlers (referencing instance_id and
logger.error); do not swallow other exceptions—let unexpected exceptions
propagate (or re-raise) so genuine bugs surface. Ensure the handlers reference
the same logging message and include the caught exception details.

Critical fixes:
- Fix FP calculation in DeadCodeBenchmark: fp = len(found_set - dead_set)
  instead of len(found_set & alive_set) which undercounted false positives
- Fix task dict mutation in SupermodelBenchmark.create_environment that
  broke A/B baseline/enhanced comparisons
- Make get_pre_merge_commit async with timeout (was blocking event loop)
- Add path containment check for corpus file writing (path traversal)
- Pass API key via curl --config file instead of CLI arg (process table)

Bug fixes:
- Parameterize hardcoded "dead_code" key via endpoint.findings_key so
  non-dead-code analysis types (impact, circular-deps, etc.) work
- Add timeouts to git subprocess calls in SupermodelBenchmark
- Add json.loads error handling in API client (non-JSON responses)
- Expand file glob from *.ts-only to include .tsx/.js/.jsx/.mjs by language
- Add _work_dir temp directory cleanup via atexit handler
- Log warning on silent git pull failure in corpus update

Cleanup:
- Extract shared utilities (_bench_utils.py): findings parser, git init,
  safe file writing — removes duplication between both benchmarks
- Remove dead code: _score_and_cap_candidates, reason_filtered, unused
  prefilter_types parameter on DeadCodePlugin.parse_api_response
- Narrow S608 lint suppression to specific files
- Add dropped-item logging in evaluation.py build_comparison_set
- Move inline imports (tempfile, os, Counter) to module level
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)

13-20: ⚠️ Potential issue | 🟠 Major

Ground truth is still counting type-only removals as dead code.

Lines 17-19 let _parse_diff() capture removed interface / type / enum exports, and Line 248 returns them as GT items unchanged. That conflicts with the prompt contract and with the later prefiltering in src/mcpbr/benchmarks/supermodel/benchmark.py, so an agent can follow the instructions and still lose recall.

Suggested fix
 TS_PATTERNS = [
     (r"^-\s*export\s+(?:async\s+)?function\s+(\w+)", "function"),
     (r"^-\s*export\s+class\s+(\w+)", "class"),
     (r"^-\s*export\s+const\s+(\w+)\s*[=:]", "const"),
-    (r"^-\s*export\s+interface\s+(\w+)", "interface"),
-    (r"^-\s*export\s+type\s+(\w+)\s*[=<{]", "type"),
-    (r"^-\s*export\s+enum\s+(\w+)", "enum"),
     (r"^-\s*export\s+default\s+(?:async\s+)?function\s+(\w+)", "function"),
     (r"^-\s*export\s+default\s+class\s+(\w+)", "class"),
 ]

Also applies to: 237-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 13 - 20,
The ground-truth generation is treating removed TypeScript-only symbols as dead
code (TS_PATTERNS includes "interface"/"type"/"enum" and _parse_diff returns
them as GT items), which conflicts with later prefiltering; update the logic so
type-only removals are excluded: either remove the ("interface","type","enum")
entries from TS_PATTERNS or add a filter inside _parse_diff to skip items whose
kind is one of {"interface","type","enum"} before they are added to/returned as
GT; ensure the change is applied where GT items are assembled/returned so
benchmark prefiltering and _parse_diff agree.
src/mcpbr/benchmarks/supermodel/benchmark.py (1)

365-385: ⚠️ Potential issue | 🔴 Critical

Enhanced artifacts are always mounted, and failures are silently downgraded.

Unlike DeadCodeBenchmark.create_environment(..., is_mcp: bool), this path always tries to materialize the Supermodel analysis in the workspace. Then Lines 522-528 just log and continue if that setup fails. So a baseline run can still see graph data when setup works, and an enhanced run can silently lose it when setup breaks. Both cases make the comparison noisy; this needs an explicit MCP/baseline gate and a fail-fast path for enhanced setup failures.

Also applies to: 522-529

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 365 - 385, This
always-attempts-to-materialize behavior should be gated and fail-fast: only
attempt to load or call self._get_analysis when the environment is the
enhanced/MCP path (use the is_mcp boolean from
DeadCodeBenchmark.create_environment(..., is_mcp: bool) or the equivalent flag),
and treat failures during that materialization as fatal for enhanced runs
(raise/propagate an exception) instead of silently logging and continuing; for
baseline runs skip materialization entirely so both baseline and enhanced runs
are symmetric. Locate the logic around task.get("cached_analysis") and the call
to self._get_analysis and add an explicit is_mcp check before trying to
materialize, and change the error handling for the enhanced path to fail-fast
rather than downgrading.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/_bench_utils.py`:
- Around line 27-33: The parser currently uses text.find(marker) which picks the
first occurrence (often a removed/`-` array); change the logic in the function
that computes marker/start/arr_start to locate the last valid array after the
key (use text.rfind(marker) or loop to find the last marker occurrence and then
search forward for the next '['), and similarly update the subsequent parsing
used in the block handling lines 38-60 so it selects the array that follows the
last marker (or prefers lines starting with '+' indicating added content) before
extracting the JSON array of findings.

In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 398-405: The reported "Found" count and returned found metric use
len(agent_findings) while scoring uses the deduplicated found_set; update all
places that log or return the found count to use len(found_set) instead of
len(agent_findings) (e.g., in the block that computes found_set, dead_set,
tp/fp/fn and where precision/recall are calculated), and apply the same
replacement for the later similar block (around the code that computes
precision/recall and calls compute_prf1()) so the logged/returned found value
matches the deduped set used for TP/FP/FN.
- Around line 220-224: The code in _load_raw_tasks pins corpus loading to a
single fixture by calling _load_corpus_task(self._corpus_dir,
"typescript-express-app"); change this to enumerate the task descriptors found
under the cloned corpus .benchmark/ directory (use self._corpus_dir from
_clone_or_update_corpus(self.corpus_path)), build tasks by calling
_load_corpus_task for each descriptor, and assign the resulting list to
self._tasks so sampling and filters apply across the whole corpus; ensure you
preserve existing behavior for filters/sample_size applied later in the method.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 207-265: The _generate_baseline_problem_statement method currently
hard-codes a dead-code template and ignores self._endpoint.baseline_prompt;
change it to first obtain the baseline text from self._endpoint.baseline_prompt
(use that when present) and only construct a fallback language-specific template
when baseline_prompt is None/empty; keep existing language detection (language,
ext) and language-specific hints but merge them into or append to the
endpoint-provided prompt rather than replacing it, and ensure the returned
string respects the ext variable and CRITICAL RULES format expected by
DeadCodePlugin.baseline_prompt.
- Around line 430-445: The entryPoints consumed from analysis_json are not
normalized the same way as candidates, so ep_set ((ep.get("file", ""),
ep.get("name", ""))) can contain paths like "src/foo/bar.ts" while candidates
have had scope_prefix removed (see scope_prefix handling around the
normalization code), causing mismatches; update the entry-point path
normalization to mirror candidate normalization: for each ep file in
entryPoints, strip the same scope_prefix using pathlib/Path semantics (not a raw
startswith) so you remove only a true directory prefix (use Path.relative_to or
compare parts) and produce the same file string format used when building
all_candidates, then rebuild ep_set from those normalized (file,name) tuples so
the filtering of all_candidates by ep_set works correctly.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 25-38: The subprocess wait calls using asyncio.wait_for around
proc.communicate() (see proc variable in git clone/zip/gh invocations) need
timeout handling to avoid orphaned child processes: wrap each await
asyncio.wait_for(proc.communicate(), timeout=300) in a try/except
asyncio.TimeoutError, call proc.kill(), await proc.wait() to ensure the child is
reaped, then raise a clear RuntimeError (e.g., "Clone timed out" / "Archive
timed out" / "GH timed out") so the caller sees the timeout; apply the same
pattern to all similar subprocess invocations in this file (the ones using proc,
proc.communicate(), and proc.returncode checks).

---

Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 365-385: This always-attempts-to-materialize behavior should be
gated and fail-fast: only attempt to load or call self._get_analysis when the
environment is the enhanced/MCP path (use the is_mcp boolean from
DeadCodeBenchmark.create_environment(..., is_mcp: bool) or the equivalent flag),
and treat failures during that materialization as fatal for enhanced runs
(raise/propagate an exception) instead of silently logging and continuing; for
baseline runs skip materialization entirely so both baseline and enhanced runs
are symmetric. Locate the logic around task.get("cached_analysis") and the call
to self._get_analysis and add an explicit is_mcp check before trying to
materialize, and change the error handling for the enhanced path to fail-fast
rather than downgrading.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 13-20: The ground-truth generation is treating removed
TypeScript-only symbols as dead code (TS_PATTERNS includes
"interface"/"type"/"enum" and _parse_diff returns them as GT items), which
conflicts with later prefiltering; update the logic so type-only removals are
excluded: either remove the ("interface","type","enum") entries from TS_PATTERNS
or add a filter inside _parse_diff to skip items whose kind is one of
{"interface","type","enum"} before they are added to/returned as GT; ensure the
change is applied where GT items are assembled/returned so benchmark
prefiltering and _parse_diff agree.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad5facef-0a89-4d95-b99b-a06d83f04f59

📥 Commits

Reviewing files that changed from the base of the PR and between f071523 and 06db08a.

📒 Files selected for processing (10)
  • pyproject.toml
  • src/mcpbr/benchmarks/_bench_utils.py
  • src/mcpbr/benchmarks/codegraph.py
  • src/mcpbr/benchmarks/deadcode.py
  • src/mcpbr/benchmarks/supermodel/api_client.py
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/base.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/benchmarks/supermodel/evaluation.py
  • src/mcpbr/benchmarks/supermodel/git_utils.py
✅ Files skipped from review due to trivial changes (3)
  • src/mcpbr/benchmarks/codegraph.py
  • src/mcpbr/benchmarks/supermodel/evaluation.py
  • src/mcpbr/benchmarks/supermodel/api_client.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

Comment on lines +27 to +33
marker = f'"{findings_key}"'
start = text.find(marker)
if start == -1:
return findings
arr_start = text.find("[", start)
if arr_start == -1:
return findings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Diff fallback will usually read the deleted REPORT.json array.

This parser grabs the first "{findings_key}" array it sees. In a normal unified diff, that is often the removed placeholder like - "dead_code": [], so fallback scoring returns an empty list even when the patch later adds real findings. Please prefer the last valid array after the key, or explicitly handle added diff lines.

Also applies to: 38-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/_bench_utils.py` around lines 27 - 33, The parser
currently uses text.find(marker) which picks the first occurrence (often a
removed/`-` array); change the logic in the function that computes
marker/start/arr_start to locate the last valid array after the key (use
text.rfind(marker) or loop to find the last marker occurrence and then search
forward for the next '['), and similarly update the subsequent parsing used in
the block handling lines 38-60 so it selects the array that follows the last
marker (or prefers lines starting with '+' indicating added content) before
extracting the JSON array of findings.

Comment on lines +220 to +224
# Load from corpus repository
try:
self._corpus_dir = _clone_or_update_corpus(self.corpus_path)
task = _load_corpus_task(self._corpus_dir, "typescript-express-app")
self._tasks = [task]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Corpus mode is pinned to a single fixture.

_load_raw_tasks() clones the corpus and then always loads typescript-express-app. That means the rest of the corpus never participates, and sample_size / task filters mostly just slice one item. If this benchmark is meant to represent the corpus, it should enumerate the task descriptors under .benchmark/ instead of hard-coding one case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/deadcode.py` around lines 220 - 224, The code in
_load_raw_tasks pins corpus loading to a single fixture by calling
_load_corpus_task(self._corpus_dir, "typescript-express-app"); change this to
enumerate the task descriptors found under the cloned corpus .benchmark/
directory (use self._corpus_dir from _clone_or_update_corpus(self.corpus_path)),
build tasks by calling _load_corpus_task for each descriptor, and assign the
resulting list to self._tasks so sampling and filters apply across the whole
corpus; ensure you preserve existing behavior for filters/sample_size applied
later in the method.

Comment on lines +398 to +405
found_set = {(f.get("file", ""), f.get("name", "")) for f in agent_findings}
dead_set = {(d.get("file", ""), d.get("name", "")) for d in expected_dead}

tp = len(found_set & dead_set)
fp = len(found_set - dead_set)
fn = len(dead_set - found_set)

precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

found should use the deduped prediction set.

The actual scoring uses found_set, so duplicates do not change TP/FP/FN. But the log output and returned found metric still use len(agent_findings). A response that reports the same symbol twice will print a bigger "Found" count than the one used for precision/recall. Reporting len(found_set) keeps the metrics internally consistent and aligned with compute_prf1().

Also applies to: 412-433

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/deadcode.py` around lines 398 - 405, The reported
"Found" count and returned found metric use len(agent_findings) while scoring
uses the deduplicated found_set; update all places that log or return the found
count to use len(found_set) instead of len(agent_findings) (e.g., in the block
that computes found_set, dead_set, tp/fp/fn and where precision/recall are
calculated), and apply the same replacement for the later similar block (around
the code that computes precision/recall and calls compute_prf1()) so the
logged/returned found value matches the deduped set used for TP/FP/FN.

Comment on lines +207 to +265
def _generate_baseline_problem_statement(self, task_cfg: dict) -> str:
"""Generate problem statement for the baseline (manual analysis) condition.

The agent must find dead code by reading and searching the codebase directly.
"""
language = task_cfg.get("language", "typescript")

ext = ".ts" if language == "typescript" else ".py"
if language == "python":
lang_hints = """- Functions in __all__ that are never actually imported by other modules
- Cleanup/utility functions whose associated state is never populated"""
else:
lang_hints = """- Exported functions/classes that are never imported by any other module
- Middleware or handlers that are defined but never registered with the router
- Methods on classes where the class itself is never instantiated from live code"""

return f"""You are a code analyst. Find dead code in this {language} codebase.

Dead code = exported functions, classes, methods, interfaces, and constants that
are defined but never used in any meaningful execution path.

{lang_hints}

== STRATEGY ==

STEP 1: Get an overview of the codebase structure.
- List the top-level directories and key source files.
- Identify the main source directories (exclude node_modules, dist, build, tests).

STEP 2: Scan source files for exported symbols.
- Focus on non-test, non-generated source files.
- For each file, note exported functions, classes, interfaces, constants.

STEP 3: For each exported symbol, grep the codebase for references.
- If it only appears in its own definition file (and possibly tests or
barrel/index re-exports), it is likely dead.
- Barrel re-exports (index.ts) do NOT count as real usage.
- Type-only imports do NOT count as real usage.

STEP 4: Write REPORT.json EARLY (after scanning even a few files).
- Write what you have so far, then continue scanning and UPDATE the file.
- This ensures you always produce output even if you run out of iterations.

REPORT.json format:
{{
"dead_code": [
{{"file": "path/to/file{ext}", "name": "unusedFunc", "type": "function", "reason": "no callers found"}},
...
],
"analysis_complete": true
}}

CRITICAL RULES:
- Type should be one of: function, class, method, const, interface, variable.
- When in doubt about whether something is dead, INCLUDE it.
- False positives are acceptable. Missing real dead code is NOT acceptable.
- Write REPORT.json after analyzing each batch — do NOT wait until the end.
- Prioritize breadth over depth: scan ALL source files before deep-diving any one."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The baseline prompt is still hard-coded to dead-code rules.

This bypasses self._endpoint.baseline_prompt entirely. Even for analysis_type="dead-code", Line 225 tells the model to include interfaces, which contradicts DeadCodePlugin.baseline_prompt; for any other endpoint, it is just the wrong task. The benchmark should build the baseline prompt from the selected plugin instead of a dead-code-specific template.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 207 - 265, The
_generate_baseline_problem_statement method currently hard-codes a dead-code
template and ignores self._endpoint.baseline_prompt; change it to first obtain
the baseline text from self._endpoint.baseline_prompt (use that when present)
and only construct a fallback language-specific template when baseline_prompt is
None/empty; keep existing language detection (language, ext) and
language-specific hints but merge them into or append to the endpoint-provided
prompt rather than replacing it, and ensure the returned string respects the ext
variable and CRITICAL RULES format expected by DeadCodePlugin.baseline_prompt.

Comment on lines +430 to +445
# Build entry point set for cross-reference filtering
ep_set = set()
for ep in analysis_json.get("entryPoints", []):
ep_file = ep.get("file", "")
ep_name = ep.get("name", "")
if ep_file and ep_name:
ep_set.add((ep_file, ep_name))

# Drop candidates that match entry points
if ep_set:
before_ep = len(all_candidates)
all_candidates = [
c
for c in all_candidates
if (c.get("file", ""), c.get("name", "")) not in ep_set
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scoped runs are normalizing candidates and entry points differently.

Line 604 strips scope_prefix from candidate file paths, but the entryPoints consumed at Lines 432-445 keep the old paths. A candidate can become foo/bar.ts while the matching entry point stays src/foo/bar.ts, so the live symbol no longer filters out. Also, raw startswith() will happily strip src/app from src/application/...; use a real directory boundary or Path logic here.

Also applies to: 604-611

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 430 - 445, The
entryPoints consumed from analysis_json are not normalized the same way as
candidates, so ep_set ((ep.get("file", ""), ep.get("name", ""))) can contain
paths like "src/foo/bar.ts" while candidates have had scope_prefix removed (see
scope_prefix handling around the normalization code), causing mismatches; update
the entry-point path normalization to mirror candidate normalization: for each
ep file in entryPoints, strip the same scope_prefix using pathlib/Path semantics
(not a raw startswith) so you remove only a true directory prefix (use
Path.relative_to or compare parts) and produce the same file string format used
when building all_candidates, then rebuild ep_set from those normalized
(file,name) tuples so the filtering of all_candidates by ep_set works correctly.

Comment on lines +25 to +38
proc = await asyncio.create_subprocess_exec(
"git",
"clone",
"--quiet",
"--depth",
"1",
clone_url,
dest,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
_, stderr = await asyncio.wait_for(proc.communicate(), timeout=300)
if proc.returncode != 0:
raise RuntimeError(f"Clone failed: {stderr.decode()}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Python asyncio, if asyncio.wait_for(proc.communicate(), timeout=...)times out for a subprocess created withasyncio.create_subprocess_exec, is the child process automatically terminated, or does it need to be killed explicitly?

💡 Result:

No. asyncio.wait_for(proc.communicate(), timeout=...) only cancels the awaiting Task when the timeout expires—it does not automatically terminate the underlying OS subprocess, so the child can keep running unless you stop it yourself. The docs explicitly describe wait_for() as cancelling the task on timeout, not killing external resources. [2]

So on timeout you should explicitly terminate/kill the process (and then await proc.wait() to reap it), e.g.:

try:
    out, err = await asyncio.wait_for(proc.communicate(), timeout=5)
except TimeoutError:
    proc.kill()          # or proc.terminate()
    await proc.wait()
    raise

asyncio provides Process.terminate() / Process.kill() for this purpose. [1] Reports/discussion around asyncio subprocess timeouts also note that a wait_for() timeout doesn’t kill the subprocess automatically. [3]

Sources: [1] asyncio subprocess Process.kill/terminate docs, [2] asyncio.wait_for() timeout/cancellation docs, [3] Python issue discussion on subprocess timeout behavior.


Add cleanup for subprocesses that timeout.

When asyncio.wait_for() times out, it cancels the waiting task—but it doesn't automatically kill the child process. So if git clone (or gh or zip) hangs and hits the 300-second timeout, the process keeps running in the background forever. On a long benchmark batch, you'll pile up zombie processes and run out of file descriptors. Bad times.

Here's the fix: wrap proc.communicate() in a try-except, catch the timeout, kill the process, wait for it to finish, then re-raise:

try:
    _, stderr = await asyncio.wait_for(proc.communicate(), timeout=300)
except asyncio.TimeoutError:
    proc.kill()
    await proc.wait()
    raise RuntimeError(f"Clone timed out")

This applies to all subprocess calls in the file: lines 25–38, 40–54, 56–67, 80–92, 136–145, 185–193.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 25 - 38, The
subprocess wait calls using asyncio.wait_for around proc.communicate() (see proc
variable in git clone/zip/gh invocations) need timeout handling to avoid
orphaned child processes: wrap each await asyncio.wait_for(proc.communicate(),
timeout=300) in a try/except asyncio.TimeoutError, call proc.kill(), await
proc.wait() to ensure the child is reaped, then raise a clear RuntimeError
(e.g., "Clone timed out" / "Archive timed out" / "GH timed out") so the caller
sees the timeout; apply the same pattern to all similar subprocess invocations
in this file (the ones using proc, proc.communicate(), and proc.returncode
checks).

greynewell and others added 3 commits March 24, 2026 14:59
Precision is not a fair gate for dead-code tasks: the API surfaces many
valid candidates beyond the PR's ground-truth set, so max achievable
precision is bounded by |GT| / |candidates| — often well below 80% for
large repos regardless of agent quality. Recall (did the agent find the
items the PR deleted?) is the meaningful signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The enhanced_prompt_v2 filter script was dropping any candidate whose
reason contained "no references", which is the primary signal for dead
exports. This caused the MCP agent to discard ~99% of valid candidates,
reporting only 14 out of 1,900 for typescript_pr56817 (6.5% recall).

Only drop pure type/interface candidates, which have elevated FP rates
from TypeScript structural typing. "No references" on its own means the
symbol is genuinely unused and should be kept.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove interface/type/enum from TS_PATTERNS so GT only contains
  runtime symbols (functions, classes, consts) — aligns with the prompt
  contract that tells agents not to report type-only declarations, and
  eliminates a class of GT/agent-output mismatches that lower recall
- Add ge/le constraints on resolved_threshold in config (must be 0–1)
- Update resolved_threshold description to reflect recall-only criterion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/mcpbr/benchmarks/supermodel/benchmark.py (3)

209-266: ⚠️ Potential issue | 🟠 Major

The baseline prompt is still hardcoded and doesn't use self._endpoint.baseline_prompt.

This was flagged in a previous review. The issue is:

  1. Line 225-228 tells the agent to find "interfaces" as dead code
  2. But DeadCodePlugin.baseline_prompt (lines 90-91 in dead_code.py) explicitly says "Do NOT include: Type definitions, interfaces, or enums"

So the baseline agent gets conflicting instructions depending on which prompt actually gets used. Since _generate_baseline_problem_statement completely ignores self._endpoint.baseline_prompt, you're not leveraging the endpoint's prompt at all.

A simple fix would be to delegate to the endpoint:

🔧 Suggested fix
 def _generate_baseline_problem_statement(self, task_cfg: dict) -> str:
-    """Generate problem statement for the baseline (manual analysis) condition.
-
-    The agent must find dead code by reading and searching the codebase directly.
-    """
-    language = task_cfg.get("language", "typescript")
-
-    ext = ".ts" if language == "typescript" else ".py"
-    # ... rest of hardcoded template ...
+    """Generate problem statement for the baseline (manual analysis) condition."""
+    # Use the endpoint's baseline prompt if available
+    scope_prefix = task_cfg.get("scope_prefix")
+    return self._endpoint.scope_prompt(self._endpoint.baseline_prompt, scope_prefix)

If you need language-specific customization on top of the endpoint prompt, you could append those hints rather than replacing the entire prompt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 209 - 266, The
baseline prompt is hardcoded in _generate_baseline_problem_statement which
conflicts with DeadCodePlugin.baseline_prompt; change
_generate_baseline_problem_statement to start from
self._endpoint.baseline_prompt (or use it as the base string) and then append
any language-specific hints/extension (e.g., the lang_hints for
typescript/python) instead of replacing or duplicating instructions, ensuring
you remove or reconcile the hardcoded "interfaces" line so it doesn't contradict
DeadCodePlugin.baseline_prompt; locate the method
_generate_baseline_problem_statement and DeadCodePlugin.baseline_prompt to
implement this delegation/append behavior.

432-447: ⚠️ Potential issue | 🟠 Major

Path normalization mismatch between entry points and candidates in scoped runs.

This was flagged in a previous review and appears to still be an issue. Here's the problem in simple terms:

What happens:

  1. Line 608-613: Candidate file paths get scope_prefix stripped (e.g., src/app/foo.tsapp/foo.ts)
  2. Lines 434-438: Entry point paths are NOT stripped (still src/app/foo.ts)

The bug:
When you check if (c.get("file", ""), c.get("name", "")) not in ep_set at line 446, a candidate with path app/foo.ts won't match an entry point with path src/app/foo.ts, so live symbols won't get filtered out.

Also, startswith(prefix) at line 612 is risky—it would incorrectly strip src/app from src/application/utils.ts.

🔧 Suggested fix for safer prefix stripping
 if scope_prefix and strip_prefix:
-    prefix = scope_prefix.rstrip("/") + "/"
+    # Use path parts comparison to avoid partial directory name matches
+    from pathlib import PurePosixPath
+    prefix_parts = PurePosixPath(scope_prefix).parts
     for key in ("deadCodeCandidates", "candidates", "items"):
         if key in result:
             for item in result[key]:
                 fp = item.get("file", "")
-                if fp.startswith(prefix):
-                    item["file"] = fp[len(prefix):]
+                fp_path = PurePosixPath(fp)
+                if fp_path.parts[:len(prefix_parts)] == prefix_parts:
+                    item["file"] = str(PurePosixPath(*fp_path.parts[len(prefix_parts):]))

Also applies to: 604-613

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 432 - 447,
Entry-point file paths are not normalized the same as candidate files, so
(file,name) tuples in ep_set won't match candidates when scope_prefix was
stripped; update the ep_set-building loop that reads
analysis_json.get("entryPoints", []) to apply the same normalization used for
candidates (use the same scope_prefix variable and the same path normalization
logic), and change any naive startswith(prefix) checks to a safe strip that only
removes the prefix when it matches a directory boundary (e.g., check prefix ==
path or path.startswith(prefix + os.sep) after os.path.normpath/os.path.sep
normalization) so entries in ep_set and all_candidates compare against identical
normalized file paths. Ensure you reference and reuse the same normalization
utility or code used where candidates are processed (the logic around
scope_prefix and all_candidates) so both sides are consistent.

524-530: ⚠️ Potential issue | 🟡 Minor

Catch narrower exceptions instead of bare Exception.

The static analysis tool flagged this, and it was also mentioned in a previous review. Catching Exception can hide bugs like KeyError or AttributeError that indicate programming errors.

The likely exceptions here are:

  • RuntimeError from API failures
  • OSError / IOError from file operations
  • json.JSONDecodeError from parsing
🔧 Suggested fix
-        except Exception as e:
+        except (RuntimeError, OSError, json.JSONDecodeError) as e:
             logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}")

This way, if there's an unexpected TypeError or KeyError from a bug in your code, it'll bubble up and you'll know about it instead of silently continuing with a broken state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 524 - 530, The
current broad except block that logs "Failed to get Supermodel analysis" catches
all Exceptions (e.g., the line using "except Exception as e:") which can hide
programming errors; replace it with a narrowed catch that only handles expected
failures such as RuntimeError, OSError (or IOError) and json.JSONDecodeError
around the Supermodel analysis retrieval (reference the variables instance_id,
logger, traceback in this block), e.g., catch those specific exception types and
log/print the error, and allow any other exception (TypeError, KeyError,
AttributeError, etc.) to propagate so they surface during testing; if you must
explicitly handle unexpected exceptions, re-raise them after logging.
🧹 Nitpick comments (3)
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)

199-232: Consider precompiling the regex patterns for better performance.

Right now, framework_names and framework_files are compiled fresh on every call to parse_api_response(). If you're processing many candidates across multiple tasks, moving these to module-level constants saves repeated compilation:

♻️ Optional refactor
+# Module-level compiled patterns for parse_api_response
+FRAMEWORK_NAMES_RE = re.compile(
+    r"^(execute|up|down|Template|Story|stories|test|Test|Mock|mock|"
+    r"Fixture|Spec|Suite|describe|it|expect|beforeEach|afterEach|"
+    r"setUp|tearDown|default|module|exports|require)$"
+)
+FRAMEWORK_FILES_RE = re.compile(
+    r"(\.test\.|\.spec\.|\.stories\.|__tests__|__mocks__|"
+    r"\.storybook|\.e2e\.|migrations/|\.d\.ts$)"
+)
+
 class DeadCodePlugin(EndpointPlugin):
     # ...
     def parse_api_response(self, response: dict) -> dict:
         candidates = response.get("deadCodeCandidates", [])

-        framework_names = re.compile(...)
-        framework_files = re.compile(...)
-
         filtered = []
         for c in candidates:
             name = c.get("name", "")
             filepath = c.get("file", "")
-            if framework_names.match(name):
+            if FRAMEWORK_NAMES_RE.match(name):
                 continue
-            if framework_files.search(filepath):
+            if FRAMEWORK_FILES_RE.search(filepath):
                 continue
             filtered.append(c)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 199 -
232, The regex patterns framework_names and framework_files are being compiled
on every call to parse_api_response which is wasteful; move those
re.compile(...) calls out of parse_api_response into module-level constants
(e.g. FRAMEWORK_NAMES, FRAMEWORK_FILES) and then update parse_api_response to
reference those constants instead of recreating the patterns so regex
compilation happens once at import time.
src/mcpbr/config.py (1)

877-913: LGTM on the new Supermodel config fields, with one optional suggestion.

The validation on resolved_threshold using Pydantic's ge/le constraints is solid. The docstrings are clear and the defaults make sense.

One thing to consider: analysis_type accepts any string, but based on the PR description, only specific values are valid (dead-code, impact, test-coverage, circular-deps). You could add a field validator similar to validate_benchmark to catch typos early:

VALID_ANALYSIS_TYPES = ("dead-code", "impact", "test-coverage", "circular-deps")

`@field_validator`("analysis_type")
`@classmethod`
def validate_analysis_type(cls, v: str | None) -> str | None:
    if v is not None and v not in VALID_ANALYSIS_TYPES:
        raise ValueError(f"Invalid analysis_type: {v}. Must be one of: {', '.join(VALID_ANALYSIS_TYPES)}")
    return v

This way if someone types "dead_code" (underscore) instead of "dead-code" (hyphen), they get a helpful error instead of a confusing runtime failure later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/config.py` around lines 877 - 913, Add strict validation for the
analysis_type field: define a VALID_ANALYSIS_TYPES tuple containing "dead-code",
"impact", "test-coverage", "circular-deps" and add a field validator (named e.g.
validate_analysis_type) decorated with `@field_validator`("analysis_type")
`@classmethod` that checks if analysis_type is not None and is in
VALID_ANALYSIS_TYPES, raising a ValueError with a helpful message if not; mirror
the style of the existing validate_benchmark implementation so typos like
"dead_code" are caught at config validation time.
src/mcpbr/benchmarks/supermodel/benchmark.py (1)

84-162: Heads up: filter_category filters on language field, which might be confusing.

At line 155-156:

if filter_category:
    category_set = set(filter_category)
    tasks = [t for t in tasks if t.get("language", "typescript") in category_set]

So if someone passes --filter-category python, they're actually filtering by language. The naming is a bit misleading. Consider either:

  1. Renaming this to filter_language if that's the intent
  2. Adding a separate category field to tasks if you want to support both

Not blocking, just something to be aware of for future maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 84 - 162, The
current filter_category parameter in load_tasks actually filters the task
"language" key, which is confusing; either rename the parameter to
filter_language everywhere (function signature of load_tasks, any callers, CLI
flags, and docs) and adjust local variable names (e.g., category_set ->
language_set) so the intent matches the code, or add a real "category" field
into the task dict (in the task creation block where keys like "language" and
"description" are set) and change the filtering to check t.get("category") for
filter_category; update corresponding tests/CLI callers accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 209-266: The baseline prompt is hardcoded in
_generate_baseline_problem_statement which conflicts with
DeadCodePlugin.baseline_prompt; change _generate_baseline_problem_statement to
start from self._endpoint.baseline_prompt (or use it as the base string) and
then append any language-specific hints/extension (e.g., the lang_hints for
typescript/python) instead of replacing or duplicating instructions, ensuring
you remove or reconcile the hardcoded "interfaces" line so it doesn't contradict
DeadCodePlugin.baseline_prompt; locate the method
_generate_baseline_problem_statement and DeadCodePlugin.baseline_prompt to
implement this delegation/append behavior.
- Around line 432-447: Entry-point file paths are not normalized the same as
candidate files, so (file,name) tuples in ep_set won't match candidates when
scope_prefix was stripped; update the ep_set-building loop that reads
analysis_json.get("entryPoints", []) to apply the same normalization used for
candidates (use the same scope_prefix variable and the same path normalization
logic), and change any naive startswith(prefix) checks to a safe strip that only
removes the prefix when it matches a directory boundary (e.g., check prefix ==
path or path.startswith(prefix + os.sep) after os.path.normpath/os.path.sep
normalization) so entries in ep_set and all_candidates compare against identical
normalized file paths. Ensure you reference and reuse the same normalization
utility or code used where candidates are processed (the logic around
scope_prefix and all_candidates) so both sides are consistent.
- Around line 524-530: The current broad except block that logs "Failed to get
Supermodel analysis" catches all Exceptions (e.g., the line using "except
Exception as e:") which can hide programming errors; replace it with a narrowed
catch that only handles expected failures such as RuntimeError, OSError (or
IOError) and json.JSONDecodeError around the Supermodel analysis retrieval
(reference the variables instance_id, logger, traceback in this block), e.g.,
catch those specific exception types and log/print the error, and allow any
other exception (TypeError, KeyError, AttributeError, etc.) to propagate so they
surface during testing; if you must explicitly handle unexpected exceptions,
re-raise them after logging.

---

Nitpick comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 84-162: The current filter_category parameter in load_tasks
actually filters the task "language" key, which is confusing; either rename the
parameter to filter_language everywhere (function signature of load_tasks, any
callers, CLI flags, and docs) and adjust local variable names (e.g.,
category_set -> language_set) so the intent matches the code, or add a real
"category" field into the task dict (in the task creation block where keys like
"language" and "description" are set) and change the filtering to check
t.get("category") for filter_category; update corresponding tests/CLI callers
accordingly.

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 199-232: The regex patterns framework_names and framework_files
are being compiled on every call to parse_api_response which is wasteful; move
those re.compile(...) calls out of parse_api_response into module-level
constants (e.g. FRAMEWORK_NAMES, FRAMEWORK_FILES) and then update
parse_api_response to reference those constants instead of recreating the
patterns so regex compilation happens once at import time.

In `@src/mcpbr/config.py`:
- Around line 877-913: Add strict validation for the analysis_type field: define
a VALID_ANALYSIS_TYPES tuple containing "dead-code", "impact", "test-coverage",
"circular-deps" and add a field validator (named e.g. validate_analysis_type)
decorated with `@field_validator`("analysis_type") `@classmethod` that checks if
analysis_type is not None and is in VALID_ANALYSIS_TYPES, raising a ValueError
with a helpful message if not; mirror the style of the existing
validate_benchmark implementation so typos like "dead_code" are caught at config
validation time.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fe28a310-fdcb-490d-933c-5942ffdca6fe

📥 Commits

Reviewing files that changed from the base of the PR and between 06db08a and 6f247d9.

📒 Files selected for processing (3)
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/config.py

@greynewell greynewell merged commit 6c76da8 into main Mar 24, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants