feat: add Supermodel dead-code benchmark plugin#4
Conversation
Adds a first-class Supermodel benchmark to mcpbr that evaluates dead-code
detection using GitHub PRs as ground truth and the Supermodel API for
pre-computed dependency graph analysis.
## New files
- `benchmarks/supermodel/` — benchmark package
- `benchmark.py` — SupermodelBenchmark; PR-based GT extraction,
Supermodel API integration, chunked analysis packaging (v3)
- `api_client.py` — async polling client for Supermodel API
- `evaluation.py` — P/R/F1 scoring
- `git_utils.py` — clone_repo_at_commit, zip_repo helpers
- `endpoints/` — pluggable endpoint system
- `dead_code.py` — DeadCodePlugin with enhanced_prompt_v2
- stubs for impact/test-coverage/circular-deps
- `benchmarks/deadcode.py` — corpus-based DeadCodeBenchmark variant
## Changes to existing files
- `benchmarks/__init__.py` — register SupermodelBenchmark + DeadCodeBenchmark
- `config.py` — add "supermodel" + "dead-code" to VALID_BENCHMARKS; add
supermodel config fields (analysis_type, tasks, supermodel_api_base,
supermodel_api_key, supermodel_api_timeout, resolved_threshold,
ground_truth_dir)
- `harness.py` — wire supermodel/dead-code benchmark_kwargs
- `models.py` — add claude-opus-4-6, claude-sonnet-4-6
- `pricing.py` — add 4.6 model pricing
- `cli.py` — use microsecond timestamps for output dirs to avoid collisions
- `docker_env.py` + `Dockerfile` — use MCR mirror for python:3.11-slim base
to work around Docker Hub rate limiting
- `pyproject.toml` — add S608 per-file-ignore for benchmarks/ (false positive
on prompt templates)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughAdds two new benchmarks ("dead-code" and "supermodel") with Supermodel endpoint/plugin infra, async API client, git/zip helpers, evaluation utilities, and bench registration; updates config/harness, Docker fallback image mirror, Claude 4.6 models/pricing, and bumps package version to 0.14.1. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Harness
participant Bench as SupermodelBenchmark
participant GitHub
participant SupermodelAPI as Supermodel<br/>API
participant Docker as Docker<br/>Container
participant Eval as Evaluation<br/>Module
User->>Harness: run_evaluation(config)
Harness->>Bench: create_benchmark(..., analysis_type, tasks, api_key, ...)
Bench->>GitHub: (optional) gh api / fetch PR diff
GitHub-->>Bench: diff text
Bench->>Bench: extract_ground_truth / load cached GT
Bench->>Bench: prepare workspace + REPORT.json
Bench->>Bench: zip_repo(workspace)
Bench->>SupermodelAPI: call_supermodel_api(zip) (upload + poll)
SupermodelAPI-->>Bench: analysis result
Bench->>Docker: write analysis files into container workspace
Harness->>Docker: start agent (agent may update REPORT.json)
Docker-->>Harness: solution / REPORT.json
Harness->>Bench: evaluate(env, task, solution)
Bench->>Eval: compute_prf1(predictions, ground_truth)
Eval-->>Bench: metrics
Bench-->>Harness: {metrics, resolved}
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (17)
src/mcpbr/config.py (1)
877-911: Supermodel config fields look well-structured.A few observations:
resolved_threshold(line 903): The default of0.8means a task is "resolved" if both precision AND recall are ≥ 80%. That's a reasonable bar. You might want to add a validator to ensure it's in the[0.0, 1.0]range — right now nothing stops someone from settingresolved_threshold: 1.5which would make everything fail.
supermodel_api_timeout(line 898): 900 seconds (15 min) is generous but reasonable for large codebases. TheSupermodelBenchmarkclass already handles this correctly.
supermodel_api_key(line 893): Good that this is optional since the benchmark falls back toSUPERMODEL_API_KEYenv var.🔧 Optional: Add validation for resolved_threshold
+ `@field_validator`("resolved_threshold") + `@classmethod` + def validate_resolved_threshold(cls, v: float) -> float: + """Validate resolved_threshold is between 0 and 1.""" + if not 0.0 <= v <= 1.0: + raise ValueError("resolved_threshold must be between 0.0 and 1.0") + return v🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/config.py` around lines 877 - 911, Add a Pydantic validator to enforce resolved_threshold is within [0.0, 1.0]: implement a method (e.g. validate_resolved_threshold) on the same config model that declares the resolved_threshold Field, use `@validator`("resolved_threshold") to check the value is >= 0.0 and <= 1.0 and raise ValueError if not, so assignments like resolved_threshold = 1.5 are rejected at model validation time.src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py (1)
1-26: Looks good — stub follows the pattern nicely.This is a clean stub implementation that matches the other endpoint plugins. One small thing: the
extract_ground_truthmethod on line 25 is missing type hints for the parameters and return type that the base class defines. Not a blocker, but adding them keeps things consistent:def extract_ground_truth( self, repo: str, pr_number: int, language: str = "typescript", scope_prefix: str | None = None ) -> list[dict]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py` around lines 1 - 26, The extract_ground_truth method in TestCoveragePlugin is missing type annotations; update TestCoveragePlugin.extract_ground_truth signature to match the base class by adding parameter and return type hints (repo: str, pr_number: int, language: str = "typescript", scope_prefix: str | None = None) and a return type of list[dict] so the method signature aligns with the base EndpointPlugin definition.src/mcpbr/benchmarks/supermodel/api_client.py (2)
89-94: Lazy imports work but are a bit unusual.Moving
import tempfileandimport osto the top of the file would be more conventional. Not a blocker — the current approach works fine and has minimal overhead since the import only happens once per polling sequence.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/api_client.py` around lines 89 - 94, Move the lazy imports out of the polling block: add "import tempfile" (and "import os" if used elsewhere) to the module-level imports at the top of src/mcpbr/benchmarks/supermodel/api_client.py, then remove the local "import tempfile" inside the poll_dummy_path handling so the code that creates the NamedTemporaryFile (poll_dummy.write and poll_dummy.name usage) uses the top-level tempfile import instead.
74-74: Consider wrapping JSON decode in a try/except for better error messages.If the API returns non-JSON (like an HTML error page from a gateway timeout),
json.loads(stdout.decode())will raise aJSONDecodeErrorwith a confusing message. A small wrapper could help:💡 Suggested improvement
- response = json.loads(stdout.decode()) + try: + response = json.loads(stdout.decode()) + except json.JSONDecodeError as e: + raise RuntimeError(f"Supermodel API returned invalid JSON: {stdout.decode()[:500]}") from eSame applies to line 126 inside the polling loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/api_client.py` at line 74, Wrap the json.loads(stdout.decode()) calls (the one assigning to response and the one inside the polling loop) in a try/except catching json.JSONDecodeError, include the raw decoded stdout (or a snippet) and the original exception in the error message, and either raise a clearer exception (e.g., ValueError with context) or log and re-raise so callers see the API response that caused the parse failure; specifically change the json.loads(stdout.decode()) usage that sets response and the identical call at the polling loop to this guarded pattern.src/mcpbr/benchmarks/supermodel/endpoints/base.py (1)
91-94: Minor:should_skip_filecould raise on invalid regex patterns.If someone puts a malformed regex in
skip_patterns,re.search(p, filepath)will raisere.error. In practice this is unlikely since patterns are config-controlled, but a try/except or pre-validation could make debugging easier:# Example: this would blow up skip_patterns = ["[invalid"] # unclosed bracketNot a blocker — just something to keep in mind if you expose this to less careful config authors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/base.py` around lines 91 - 94, should_skip_file currently calls re.search(p, filepath) for each pattern and can raise re.error on malformed regexes; update should_skip_file to either precompile skip_patterns (using re.compile) and catch re.error during compilation or wrap the per-pattern re.search in a try/except catching re.error, then handle invalid patterns by logging a warning or skipping that pattern and continuing (or re-raising a clearer ValueError with context) so a bad regex in skip_patterns doesn't crash the function.src/mcpbr/pricing.py (1)
34-54: Pricing entries look correct and consistent.The new Claude 4.6 pricing follows the same structure as existing entries. One tiny thing: line 6 says "Pricing is per million tokens (MTok) and is current as of January 2026" but you're adding March 2026 models here. Might want to bump that date to keep the header accurate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/pricing.py` around lines 34 - 54, Update the header that states "Pricing is per million tokens (MTok) and is current as of January 2026" to reflect March 2026 so it matches the newly added Claude 4.6 entries (e.g., the ModelPricing entries "claude-opus-4-6" and "claude-sonnet-4-6"); locate the top-of-file header/comment in the same module that describes the pricing date and change "January 2026" to "March 2026".src/mcpbr/benchmarks/supermodel/git_utils.py (3)
71-88: Consider adding a timeout toget_pre_merge_commit.This is the only subprocess call in the file without a timeout. If the GitHub API is slow or hangs, this could block indefinitely. The async functions use
asyncio.wait_forwith timeouts, but this sync function doesn't have the same protection.🔧 Add a timeout for consistency
result = subprocess.run( ["gh", "api", f"repos/{repo}/commits/{merge_commit}", "--jq", ".parents[0].sha"], capture_output=True, text=True, + timeout=60, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 71 - 88, The synchronous get_pre_merge_commit function currently calls subprocess.run without a timeout; update the subprocess.run invocation in get_pre_merge_commit to include a timeout (e.g., timeout=30) so the call cannot hang indefinitely, and handle the TimeoutExpired exception by raising a RuntimeError with a clear message including the merge_commit and the exception details; ensure you modify only the subprocess.run call and add the TimeoutExpired exception handling near get_pre_merge_commit so behavior matches the async functions' timeout protection.
169-179: Edge case: exclude patterns that are absolute paths won't work correctly.So the logic here is: if there's a
scope_prefixand a pattern doesn't already start with it, prepend the prefix. But imagine someone passes an absolute path pattern like/node_modules/*- it'll get the prefix prepended and become something likepackages/api//node_modules/*which probably isn't what they wanted.In practice, the
base_excludeslist uses relative patterns, so this is fine for the defaults. Just something to be aware of if external callers pass weird patterns.💡 Optional: add a check for absolute patterns
prefixed_excludes = [] for pattern in base_excludes + (exclude_patterns or []): - if scope_prefix and not pattern.startswith(scope_prefix): + if scope_prefix and not pattern.startswith(scope_prefix) and not pattern.startswith("/"): prefixed_excludes.append(f"{scope_prefix}/{pattern}") else: prefixed_excludes.append(pattern)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 169 - 179, The current loop that builds prefixed_excludes blindly prepends scope_prefix to patterns (base_excludes and exclude_patterns), which corrupts absolute paths like "/node_modules/*"; update the logic in the block that constructs prefixed_excludes (the loop referencing scope_prefix, base_excludes, exclude_patterns and variable pattern) to first detect absolute patterns (use os.path.isabs(pattern) or equivalent) and skip prefixing for those, i.e., if pattern is absolute leave it unchanged, otherwise apply the existing scope_prefix-prefixing check; ensure the rest of the code that extends cmd with "-x" uses the adjusted prefixed_excludes.
122-141:git archiveignoresexclude_patterns- is that intentional?When the repo has a
.gitdirectory, you usegit archivewhich is great because it respects.gitignore. But theexclude_patternsparameter is only used in the fallback path (_zip_repo_fallback), not in_zip_repo_git_archive.If a caller explicitly passes
exclude_patterns, they might expect those to apply regardless of whether it's a git repo or not. Sincegit archivenaturally excludes untracked files anyway, this might be fine in practice, but it's a slight API inconsistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 122 - 141, _zip_repo_git_archive currently ignores any exclude_patterns; update it to accept an exclude_patterns parameter (same shape as used by _zip_repo_fallback) and apply those exclusions even when using git. Implement by listing tracked files with "git ls-files -z" (run from repo_dir), filter that list against exclude_patterns, and then create the zip from the remaining paths (either by passing the filtered paths to git archive if possible or by building the zip in Python from those files); reference _zip_repo_git_archive and _zip_repo_fallback so the caller behavior is consistent and exclusions are applied in both code paths.src/mcpbr/benchmarks/supermodel/benchmark.py (5)
151-157:filter_categorycompares againstlanguagefield - might be confusing.On line 157, you filter by
filter_categorybut compare againstt.get("language", "typescript"). This works, but the naming is a bit asymmetric - a user might expectfilter_categoryto filter on acategoryfield, notlanguage.For example, if I call
load_tasks(filter_category=["python"]), it filters tasks wherelanguage == "python". Makes sense in context, but the naming could trip someone up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 151 - 157, The code filters tasks using filter_category but checks t.get("language", ...), which is confusing; either change the key to "category" or rename the filter to match "language". Update the filtering line inside the function that uses filter_category (the list comprehension over tasks referencing t.get("language", "typescript")) to t.get("category", "default_category") if you want to keep the filter name, or rename the parameter filter_category -> filter_language and update all callers and the comprehension to use the new name so the intent and field name align.
615-639: Git subprocess calls are missing timeouts.These
subprocess.runcalls for initializing the git repo don't have timeouts. TheDeadCodeBenchmarkin the other file does add timeouts (e.g.,timeout=30for git init). For consistency and reliability, consider adding timeouts here too.In practice,
git initandgit configare near-instant, butgit add -Aon a large repo could potentially hang if there's filesystem issues.🔧 Add timeouts for consistency
- subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False) + subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False, timeout=30) subprocess.run( ["git", "config", "user.email", "mcpbr@test.com"], cwd=host_workdir, capture_output=True, check=False, + timeout=10, ) # ... same for the other calls🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 615 - 639, The git subprocess.run invocations that initialize and configure the repo (the calls using ["git", "init"], ["git", "config", ...], ["git", "add", "-A"], and ["git", "commit", "-m", "Initial"] with cwd=host_workdir) lack timeouts; update each subprocess.run to include a reasonable timeout (e.g., timeout=30) to match the pattern used in DeadCodeBenchmark and prevent hangs, keeping capture_output=True and check=False unchanged.
574-581: Catching broadExceptionis intentional here, but consider logging the stack trace.I see the static analysis flagged this. In a benchmark runner, it makes sense to catch all exceptions so one failing task doesn't kill the whole run. You're already logging the error and printing the traceback, which is good.
One thing:
logger.erroron line 575 only gets the error message, while thelogger.exception()instead which automatically includes the traceback.🔧 Use logger.exception for full traceback in logs
except Exception as e: - logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}") + logger.exception(f"Failed to get Supermodel analysis for {instance_id}") print( f"\n*** SUPERMODEL ANALYSIS FAILED for {instance_id} ***\n"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 574 - 581, In the except block that currently logs "Failed to get Supermodel analysis for {instance_id}: {e}" (the handler catching Exception as e for Supermodel analysis), change the log call to include the traceback—either replace logger.error(...) with logger.exception(...) or call logger.error(..., exc_info=True)—so the full stack trace for the failure of instance_id is recorded in the logs (leave the stderr print as-is).
83-83: Temp directory is created but never explicitly cleaned up.The
self._work_diris created in__init__usingtempfile.mkdtemp(), but there's no corresponding cleanup in a destructor or context manager. For short-lived benchmark runs this is fine (OS cleans it up on exit), but if you run multiple benchmarks in a loop or have a long-running process, these temp directories will accumulate.💡 Consider adding cleanup
You could either:
- Add a
cleanup()method that users call explicitly- Use
tempfile.TemporaryDirectoryand keep a reference to it- Add a
__del__method (though those can be unreliable)# Option 1: Explicit cleanup method def cleanup(self) -> None: """Remove temporary work directory.""" if self._work_dir.exists(): shutil.rmtree(self._work_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 83, The temp work directory created in __init__ as self._work_dir via tempfile.mkdtemp() is never cleaned up; add explicit cleanup by either (a) switching to tempfile.TemporaryDirectory and storing the object (e.g. self._tmpdir) so the directory is removed when the object is closed or deleted, or (b) add a public cleanup() method that calls shutil.rmtree(self._work_dir) (guarded by exists()), and call cleanup from any teardown path; make sure to import shutil if using rmtree and update any code that relied on self._work_dir to refer to self._tmpdir.name if you choose the TemporaryDirectory approach.
361-365: Accessing private methods ondocker_manager.Calling
docker_manager._ensure_fallback_image()and laterdocker_manager._temp_dirs.append(temp_dir)accesses private attributes (indicated by the_prefix). This creates tight coupling to the internal implementation ofDockerEnvironmentManager.This might be intentional if the benchmark protocol expects benchmarks to manage their own temp directories this way, but it means changes to
DockerEnvironmentManager's internals could break this code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 361 - 365, This code accesses DockerEnvironmentManager internals by calling docker_manager._ensure_fallback_image() and appending to docker_manager._temp_dirs; instead call a public API (e.g., docker_manager.ensure_fallback_image()) and use a public method to register temp dirs (e.g., docker_manager.register_temp_dir(temp_dir) or docker_manager.create_temp_dir(prefix)) or, if those methods don't exist, add them to DockerEnvironmentManager (implement ensure_fallback_image() that wraps _ensure_fallback_image and register_temp_dir(temp_dir) that appends to the internal list) and then replace the direct accesses to _ensure_fallback_image and _temp_dirs with those public methods while still using docker_manager.FALLBACK_IMAGE and temp_dir/instance_id as before.src/mcpbr/benchmarks/deadcode.py (2)
455-481: Nice fallback extraction logic - duplicated from SupermodelBenchmark.The
_extract_findings_from_textmethod here is nearly identical to the one inSupermodelBenchmark. This is copy-paste duplication.Consider extracting this to a shared utility, maybe in
evaluation.pyor a newutils.py:# In evaluation.py or utils.py def extract_dead_code_from_text(text: str) -> list[dict]: """Extract dead_code array from text containing JSON.""" ...Not urgent since it's only in two places, but if you add more benchmarks that need this pattern, it'd be worth consolidating.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/deadcode.py` around lines 455 - 481, The _extract_findings_from_text method is duplicated (also in SupermodelBenchmark); refactor by extracting the JSON-array-parsing logic into a shared function like extract_dead_code_from_text(text: str) -> list[dict[str, Any]] placed in evaluation.py or a new utils.py, move the extraction code (including the try/except for json.JSONDecodeError/ValueError and the same bracket-matching logic) into that function, update _extract_findings_from_text in this class and the SupermodelBenchmark to call the new extract_dead_code_from_text, and add the necessary import where used so both benchmarks reuse the single implementation.
49-50:shutil.rmtreecould fail on permission issues.If
corpus_direxists but has permission problems (e.g., some files are read-only),shutil.rmtreewill raise an exception. This would crash the task loading process rather than gracefully degrading.Since this is a cache directory that the user might not care about, you could wrap this in a try/except:
🔧 Handle rmtree failures gracefully
if corpus_dir.exists(): - shutil.rmtree(corpus_dir) + try: + shutil.rmtree(corpus_dir) + except OSError as e: + # If we can't remove the old cache, try cloning anyway + import logging + logging.getLogger("mcpbr.deadcode").warning(f"Could not remove old cache: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/deadcode.py` around lines 49 - 50, Wrap the call to shutil.rmtree(corpus_dir) in a try/except so permission errors don't crash loading: in src/mcpbr/benchmarks/deadcode.py surround the existing if corpus_dir.exists(): shutil.rmtree(corpus_dir) with a try block, catch OSError/Exception, and log a non-fatal warning that includes the corpus_dir path and the exception (e.g. via logging.getLogger(__name__).warning(...)) so the code continues if deletion fails; alternatively, you can use shutil.rmtree(..., onerror=...) to attempt os.chmod before retrying, but ensure failure is handled gracefully and does not propagate.src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)
202-248: Method signature differs from base class - could cause polymorphic call issues.The base
EndpointPlugin.parse_api_response(self, response: dict) -> dicttakes justresponse, butDeadCodePlugin.parse_api_responseaddsprefilter_types: bool = False.This works because of the default value, but if the benchmark code ever calls
endpoint.parse_api_response(response, True)on a genericEndpointPlugin, you'd get aTypeErrorfor other endpoint implementations.Looking at
benchmark.pyline 676, it callsself._endpoint.parse_api_response(raw_response)without the extra arg, so you're safe for now. Just something to keep in mind if you add more endpoints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 202 - 248, DeadCodePlugin.parse_api_response currently adds a second parameter prefilter_types which diverges from the base EndpointPlugin.parse_api_response signature and can break polymorphic calls; change DeadCodePlugin.parse_api_response to match the base signature (parse_api_response(self, response: dict) -> dict) and accept the optional flag via kwargs or response metadata instead (e.g. extract prefilter_types = kwargs.get("prefilter_types", False) if you add **kwargs, or read response.get("_prefilter_types", False)), update references to prefilter_types inside the method accordingly, and ensure calls that previously passed the flag either pass it via kwargs or set it on the response/metadata so behavior remains the same while preserving polymorphism with EndpointPlugin.parse_api_response.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 420-422: DeadCodeBenchmark currently computes false positives as
fp = len(found_set & alive_set) which only counts findings explicitly labeled
alive; change it to count any predicted item not in the dead_set (consistent
with SupermodelBenchmark/compute_prf1) by computing fp = len(found_set -
dead_set) so that any prediction outside the ground-truth dead_set is treated as
an FP; update references around the fp calculation in the DeadCodeBenchmark
function/block (variables: found_set, dead_set, alive_set) and ensure downstream
metrics use the new fp value.
---
Nitpick comments:
In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 455-481: The _extract_findings_from_text method is duplicated
(also in SupermodelBenchmark); refactor by extracting the JSON-array-parsing
logic into a shared function like extract_dead_code_from_text(text: str) ->
list[dict[str, Any]] placed in evaluation.py or a new utils.py, move the
extraction code (including the try/except for json.JSONDecodeError/ValueError
and the same bracket-matching logic) into that function, update
_extract_findings_from_text in this class and the SupermodelBenchmark to call
the new extract_dead_code_from_text, and add the necessary import where used so
both benchmarks reuse the single implementation.
- Around line 49-50: Wrap the call to shutil.rmtree(corpus_dir) in a try/except
so permission errors don't crash loading: in src/mcpbr/benchmarks/deadcode.py
surround the existing if corpus_dir.exists(): shutil.rmtree(corpus_dir) with a
try block, catch OSError/Exception, and log a non-fatal warning that includes
the corpus_dir path and the exception (e.g. via
logging.getLogger(__name__).warning(...)) so the code continues if deletion
fails; alternatively, you can use shutil.rmtree(..., onerror=...) to attempt
os.chmod before retrying, but ensure failure is handled gracefully and does not
propagate.
In `@src/mcpbr/benchmarks/supermodel/api_client.py`:
- Around line 89-94: Move the lazy imports out of the polling block: add "import
tempfile" (and "import os" if used elsewhere) to the module-level imports at the
top of src/mcpbr/benchmarks/supermodel/api_client.py, then remove the local
"import tempfile" inside the poll_dummy_path handling so the code that creates
the NamedTemporaryFile (poll_dummy.write and poll_dummy.name usage) uses the
top-level tempfile import instead.
- Line 74: Wrap the json.loads(stdout.decode()) calls (the one assigning to
response and the one inside the polling loop) in a try/except catching
json.JSONDecodeError, include the raw decoded stdout (or a snippet) and the
original exception in the error message, and either raise a clearer exception
(e.g., ValueError with context) or log and re-raise so callers see the API
response that caused the parse failure; specifically change the
json.loads(stdout.decode()) usage that sets response and the identical call at
the polling loop to this guarded pattern.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 151-157: The code filters tasks using filter_category but checks
t.get("language", ...), which is confusing; either change the key to "category"
or rename the filter to match "language". Update the filtering line inside the
function that uses filter_category (the list comprehension over tasks
referencing t.get("language", "typescript")) to t.get("category",
"default_category") if you want to keep the filter name, or rename the parameter
filter_category -> filter_language and update all callers and the comprehension
to use the new name so the intent and field name align.
- Around line 615-639: The git subprocess.run invocations that initialize and
configure the repo (the calls using ["git", "init"], ["git", "config", ...],
["git", "add", "-A"], and ["git", "commit", "-m", "Initial"] with
cwd=host_workdir) lack timeouts; update each subprocess.run to include a
reasonable timeout (e.g., timeout=30) to match the pattern used in
DeadCodeBenchmark and prevent hangs, keeping capture_output=True and check=False
unchanged.
- Around line 574-581: In the except block that currently logs "Failed to get
Supermodel analysis for {instance_id}: {e}" (the handler catching Exception as e
for Supermodel analysis), change the log call to include the traceback—either
replace logger.error(...) with logger.exception(...) or call logger.error(...,
exc_info=True)—so the full stack trace for the failure of instance_id is
recorded in the logs (leave the stderr print as-is).
- Line 83: The temp work directory created in __init__ as self._work_dir via
tempfile.mkdtemp() is never cleaned up; add explicit cleanup by either (a)
switching to tempfile.TemporaryDirectory and storing the object (e.g.
self._tmpdir) so the directory is removed when the object is closed or deleted,
or (b) add a public cleanup() method that calls shutil.rmtree(self._work_dir)
(guarded by exists()), and call cleanup from any teardown path; make sure to
import shutil if using rmtree and update any code that relied on self._work_dir
to refer to self._tmpdir.name if you choose the TemporaryDirectory approach.
- Around line 361-365: This code accesses DockerEnvironmentManager internals by
calling docker_manager._ensure_fallback_image() and appending to
docker_manager._temp_dirs; instead call a public API (e.g.,
docker_manager.ensure_fallback_image()) and use a public method to register temp
dirs (e.g., docker_manager.register_temp_dir(temp_dir) or
docker_manager.create_temp_dir(prefix)) or, if those methods don't exist, add
them to DockerEnvironmentManager (implement ensure_fallback_image() that wraps
_ensure_fallback_image and register_temp_dir(temp_dir) that appends to the
internal list) and then replace the direct accesses to _ensure_fallback_image
and _temp_dirs with those public methods while still using
docker_manager.FALLBACK_IMAGE and temp_dir/instance_id as before.
In `@src/mcpbr/benchmarks/supermodel/endpoints/base.py`:
- Around line 91-94: should_skip_file currently calls re.search(p, filepath) for
each pattern and can raise re.error on malformed regexes; update
should_skip_file to either precompile skip_patterns (using re.compile) and catch
re.error during compilation or wrap the per-pattern re.search in a try/except
catching re.error, then handle invalid patterns by logging a warning or skipping
that pattern and continuing (or re-raising a clearer ValueError with context) so
a bad regex in skip_patterns doesn't crash the function.
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 202-248: DeadCodePlugin.parse_api_response currently adds a second
parameter prefilter_types which diverges from the base
EndpointPlugin.parse_api_response signature and can break polymorphic calls;
change DeadCodePlugin.parse_api_response to match the base signature
(parse_api_response(self, response: dict) -> dict) and accept the optional flag
via kwargs or response metadata instead (e.g. extract prefilter_types =
kwargs.get("prefilter_types", False) if you add **kwargs, or read
response.get("_prefilter_types", False)), update references to prefilter_types
inside the method accordingly, and ensure calls that previously passed the flag
either pass it via kwargs or set it on the response/metadata so behavior remains
the same while preserving polymorphism with EndpointPlugin.parse_api_response.
In `@src/mcpbr/benchmarks/supermodel/endpoints/test_coverage.py`:
- Around line 1-26: The extract_ground_truth method in TestCoveragePlugin is
missing type annotations; update TestCoveragePlugin.extract_ground_truth
signature to match the base class by adding parameter and return type hints
(repo: str, pr_number: int, language: str = "typescript", scope_prefix: str |
None = None) and a return type of list[dict] so the method signature aligns with
the base EndpointPlugin definition.
In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 71-88: The synchronous get_pre_merge_commit function currently
calls subprocess.run without a timeout; update the subprocess.run invocation in
get_pre_merge_commit to include a timeout (e.g., timeout=30) so the call cannot
hang indefinitely, and handle the TimeoutExpired exception by raising a
RuntimeError with a clear message including the merge_commit and the exception
details; ensure you modify only the subprocess.run call and add the
TimeoutExpired exception handling near get_pre_merge_commit so behavior matches
the async functions' timeout protection.
- Around line 169-179: The current loop that builds prefixed_excludes blindly
prepends scope_prefix to patterns (base_excludes and exclude_patterns), which
corrupts absolute paths like "/node_modules/*"; update the logic in the block
that constructs prefixed_excludes (the loop referencing scope_prefix,
base_excludes, exclude_patterns and variable pattern) to first detect absolute
patterns (use os.path.isabs(pattern) or equivalent) and skip prefixing for
those, i.e., if pattern is absolute leave it unchanged, otherwise apply the
existing scope_prefix-prefixing check; ensure the rest of the code that extends
cmd with "-x" uses the adjusted prefixed_excludes.
- Around line 122-141: _zip_repo_git_archive currently ignores any
exclude_patterns; update it to accept an exclude_patterns parameter (same shape
as used by _zip_repo_fallback) and apply those exclusions even when using git.
Implement by listing tracked files with "git ls-files -z" (run from repo_dir),
filter that list against exclude_patterns, and then create the zip from the
remaining paths (either by passing the filtered paths to git archive if possible
or by building the zip in Python from those files); reference
_zip_repo_git_archive and _zip_repo_fallback so the caller behavior is
consistent and exclusions are applied in both code paths.
In `@src/mcpbr/config.py`:
- Around line 877-911: Add a Pydantic validator to enforce resolved_threshold is
within [0.0, 1.0]: implement a method (e.g. validate_resolved_threshold) on the
same config model that declares the resolved_threshold Field, use
`@validator`("resolved_threshold") to check the value is >= 0.0 and <= 1.0 and
raise ValueError if not, so assignments like resolved_threshold = 1.5 are
rejected at model validation time.
In `@src/mcpbr/pricing.py`:
- Around line 34-54: Update the header that states "Pricing is per million
tokens (MTok) and is current as of January 2026" to reflect March 2026 so it
matches the newly added Claude 4.6 entries (e.g., the ModelPricing entries
"claude-opus-4-6" and "claude-sonnet-4-6"); locate the top-of-file
header/comment in the same module that describes the pricing date and change
"January 2026" to "March 2026".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 43bdb941-4a9a-45b7-8d00-46b0623b59a9
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
Dockerfilepyproject.tomlsrc/mcpbr/__init__.pysrc/mcpbr/benchmarks/__init__.pysrc/mcpbr/benchmarks/deadcode.pysrc/mcpbr/benchmarks/supermodel/__init__.pysrc/mcpbr/benchmarks/supermodel/api_client.pysrc/mcpbr/benchmarks/supermodel/benchmark.pysrc/mcpbr/benchmarks/supermodel/endpoints/__init__.pysrc/mcpbr/benchmarks/supermodel/endpoints/base.pysrc/mcpbr/benchmarks/supermodel/endpoints/circular_deps.pysrc/mcpbr/benchmarks/supermodel/endpoints/dead_code.pysrc/mcpbr/benchmarks/supermodel/endpoints/impact_analysis.pysrc/mcpbr/benchmarks/supermodel/endpoints/test_coverage.pysrc/mcpbr/benchmarks/supermodel/evaluation.pysrc/mcpbr/benchmarks/supermodel/git_utils.pysrc/mcpbr/config.pysrc/mcpbr/docker_env.pysrc/mcpbr/harness.pysrc/mcpbr/models.pysrc/mcpbr/pricing.py
jonathanpopham
left a comment
There was a problem hiding this comment.
Code Review — Ghost spectral analysis
+2284 / -382 across 22 files. Architecturally sound plugin-based design with solid separation of PR-based vs corpus-based benchmarks.
Critical (fix before merge)
- FP calculation wrong in
DeadCodeBenchmark— undercounts false positives, inflates precision - Task dict mutation — breaks A/B baseline/enhanced comparisons
- Sync
subprocess.runin async context — blocks event loop, no timeout - Path traversal in corpus file writing — no containment check
- API key in process table —
curl -Hexposes key inps aux
Important
- Hardcoded
"dead_code"key makes multi-analysis-type support non-functional _work_dirtemp directory leaks on every runjson.loadson raw curl output with no error handling- Only globs
*.ts— misses.tsx,.js,.jsx,.mjs - Git subprocess calls missing timeouts (unlike
DeadCodeBenchmarkwhich has them)
Minor / Cleanup
_score_and_cap_candidatesis dead codeprefilter_typesparameter is dead — duplicated filtering logicreason_filteredalways reports 0- Broad S608 lint suppression covers all benchmark files
- Significant code duplication between the two benchmarks:
_extract_findings_from_text,REPORT_PLACEHOLDER, git init sequence, Docker manager private attribute access — worth extracting to shared utilities
src/mcpbr/benchmarks/deadcode.py
Outdated
| alive_set = {(a.get("file", ""), a.get("name", "")) for a in expected_alive} | ||
|
|
||
| tp = len(found_set & dead_set) | ||
| fp = len(found_set & alive_set) |
There was a problem hiding this comment.
Bug: FP calculation undercounts false positives — corrupts precision metric.
fp = len(found_set & alive_set)This only counts FPs that happen to be in the explicit alive_set. If the agent reports a symbol that's in neither dead_set nor alive_set (hallucinated, or just not in ground truth), it's silently ignored — not counted as TP or FP. This inflates precision.
SupermodelBenchmark does it correctly via compute_prf1: fp = len(pred_set - gt_set). The two benchmarks produce incomparable precision metrics for the same concept.
Fix:
fp = len(found_set - dead_set)| "problem_statement_enhanced", task["problem_statement"] | ||
| ) | ||
| else: | ||
| task["problem_statement"] = task.get( |
There was a problem hiding this comment.
Bug: Mutating shared task dict breaks A/B comparisons.
This overwrites task["problem_statement"] in-place on the original dict returned by load_tasks(). If the same task is used for both baseline and enhanced conditions (standard A/B benchmarking), the second call reads the already-overwritten value — the baseline agent could get the enhanced prompt or vice versa.
Fix: Copy before mutating:
task = {**task}
if is_mcp:
task["problem_statement"] = task.get("problem_statement_enhanced", task["problem_statement"])| Returns: | ||
| SHA of the first parent commit. | ||
| """ | ||
| result = subprocess.run( |
There was a problem hiding this comment.
Bug: Synchronous subprocess.run in async context blocks the event loop.
Every other git/subprocess call in this module uses asyncio.create_subprocess_exec. This one uses blocking subprocess.run and is called from async create_environment(). It will stall the event loop on slow GitHub API responses.
Also missing a timeout parameter — if gh hangs (rate limiting, auth prompt), it blocks the entire process indefinitely.
Fix: Use asyncio.create_subprocess_exec with asyncio.wait_for(..., timeout=30).
src/mcpbr/benchmarks/deadcode.py
Outdated
|
|
||
| # Write all files including REPORT.json | ||
| for file_path, content in repo_content.items(): | ||
| full_path = Path(host_workdir) / file_path |
There was a problem hiding this comment.
Security: Potential path traversal.
If corpus repo_content keys contain ../../etc/passwd, Path(host_workdir) / file_path resolves outside the workdir. No path sanitization or containment check.
Since the corpus is loaded from a cloneable git repo, this is exploitable if the corpus is tampered with.
Fix: Resolve and verify containment:
full_path = (Path(host_workdir) / file_path).resolve()
assert full_path.is_relative_to(Path(host_workdir).resolve())| f"Idempotency-Key: {idempotency_key}", | ||
| ] | ||
| if api_key: | ||
| headers.extend(["-H", f"X-Api-Key: {api_key}"]) |
There was a problem hiding this comment.
Security: API key visible in process table.
Shelling out to curl with -H "X-Api-Key: ..." exposes the key in ps aux / /proc/*/cmdline. Consider using httpx/aiohttp (already in the dep tree), or at minimum pass the header via --header @<(echo ...) or a temp file with restrictive permissions.
| # failure tags real dead code with the same reason as framework- | ||
| # wired code). See issue #676 for details. | ||
| root_files = metadata.get("rootFilesCount", 0) or 0 | ||
| reason_filtered = 0 |
There was a problem hiding this comment.
Misleading metric: reason_filtered is always 0.
Initialized here but never incremented anywhere. Reported in metadata as "reasonFiltered": reason_filtered which always shows 0. Either implement the filtering or remove the variable to avoid confusion.
| than a false positive. | ||
| """ | ||
|
|
||
| def parse_api_response(self, response: dict, prefilter_types: bool = False) -> dict: |
There was a problem hiding this comment.
Dead parameter: prefilter_types is never True in practice.
Defaults to False and the only caller (benchmark.py) never passes True. The type/interface filtering that this would enable is done separately in create_environment() instead. The two filtering paths are duplicated but disconnected — maintaining both is a hazard.
Suggestion: Pick one location for the filtering logic and remove the other.
| ) | ||
|
|
||
| # Init git so the harness can track modifications | ||
| subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False) |
There was a problem hiding this comment.
Missing timeouts on git subprocess calls.
DeadCodeBenchmark correctly uses timeout=30 on all five git init/config/add/commit calls. These have no timeout. git add -A on a large repo can hang indefinitely.
pyproject.toml
Outdated
| "tests/**/*.py" = ["S", "T20"] | ||
| "infrastructure/**/*.py" = ["S603", "S607"] | ||
| "src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"] | ||
| "src/mcpbr/benchmarks/**/*.py" = ["S608"] |
There was a problem hiding this comment.
Overly broad lint suppression.
S608 (SQL injection detection) is suppressed for the entire benchmarks/ directory. The PR description says this is for false positives on f-string prompt templates, but the suppression covers all current and future benchmark files. Should be scoped to the specific files that need it:
"src/mcpbr/benchmarks/deadcode.py" = ["S608"]
"src/mcpbr/benchmarks/supermodel/benchmark.py" = ["S608"]| raw_b = item.get(fb, "") | ||
| a = normalize_path(raw_a) if fa in path_like_fields else normalize_name(raw_a) | ||
| b = normalize_path(raw_b) if fb in path_like_fields else normalize_name(raw_b) | ||
| if a and b: |
There was a problem hiding this comment.
Silent data loss: items with empty fields are dropped.
If a ground truth or prediction item has an empty name (e.g. a default export, or an unnamed function), it's silently excluded from the comparison set. This could artificially inflate recall.
Consider logging a warning when items are dropped, or using a sentinel value for empty names.
…ment The `is_mcp` flag was never passed from the harness (Protocol signature doesn't include it), so the analysis JSON was never placed in the workdir and agents always ran with the baseline prompt instead of the enhanced analysis prompt. Remove the `is_mcp` parameter to match the Benchmark Protocol signature, and always place the Supermodel analysis JSON and use the enhanced prompt. The SupermodelBenchmark is always used in enhanced (MCP) mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
src/mcpbr/benchmarks/supermodel/benchmark.py (5)
29-33:⚠️ Potential issue | 🟠 MajorHardcoded
"dead_code"key breaks other analysis types.Line 30 and Line 704 force dead-code schema even when endpoint is
impact_analysis,test_coverage, orcircular_deps. This yields empty findings for non-dead-code modes.Suggested fix
-REPORT_PLACEHOLDER = """{ - "dead_code": [], - "analysis_complete": false -} -""" +def _report_placeholder(findings_key: str) -> str: + return json.dumps({findings_key: [], "analysis_complete": False}, indent=2)- report_path.write_text(REPORT_PLACEHOLDER) + findings_key = self._endpoint.name() + report_path.write_text(_report_placeholder(findings_key))- agent_findings = report.get("dead_code", []) + findings_key = self._endpoint.name() + agent_findings = report.get(findings_key, [])Also applies to: 704-704
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 29 - 33, The REPORT_PLACEHOLDER currently hardcodes a dead-code schema (keys "dead_code" and "analysis_complete") which forces that shape for all analysis endpoints and causes empty findings for non-dead-code modes; update the logic that emits REPORT_PLACEHOLDER (symbol REPORT_PLACEHOLDER) so it builds a placeholder dynamically based on the active endpoint (e.g., "impact_analysis", "test_coverage", "circular_deps") or remove the dead_code key and return a generic empty result structure per endpoint, ensuring the placeholder structure matches each analysis type's expected keys rather than always using "dead_code".
330-332:⚠️ Potential issue | 🟠 MajorIn-place task mutation can corrupt baseline/enhanced A/B runs.
Line 330 mutates the original task dict, so a second pass can inherit the wrong prompt condition. Use a shallow copy before rewriting
problem_statement.Suggested fix
- # Always use the enhanced prompt (with analysis JSON) - task["problem_statement"] = task.get( + # Always use the enhanced prompt (with analysis JSON) + task = dict(task) + task["problem_statement"] = task.get( "problem_statement_enhanced", task["problem_statement"] )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 330 - 332, The code mutates the original task dict by assigning to task["problem_statement"] which can leak prompt conditions between runs; instead create a shallow copy of the task before modifying it (e.g., new_task = task.copy()) and rewrite new_task["problem_statement"] using the existing task.get("problem_statement_enhanced", task["problem_statement"]) expression, then use new_task for downstream processing (refer to the variable task and the "problem_statement" / "problem_statement_enhanced" keys in benchmark.py).
467-467:⚠️ Potential issue | 🟡 Minor
reasonFilteredis always reported as 0.Line 467 initializes
reason_filteredbut it is never incremented, yet Line 505 and Line 562 report it as a meaningful metric.Also applies to: 505-505, 562-563
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 467, The variable reason_filtered is initialized but never incremented, so update the code paths that determine a filtered reason (the branches that later report reason_filtered at the points corresponding to the logic around where reports are made) to increment reason_filtered whenever an item is filtered for a reason; specifically, locate the conditional branches that set or log the filtering reason (referencing the reason_filtered variable and the report sites around lines where reports occur) and add reason_filtered += 1 in each branch (or ensure the increment happens centrally where a filter decision is made), taking care to respect scope (declare nonlocal/global if inside a nested function) so the counter reflects actual filtered items.
605-629:⚠️ Potential issue | 🟠 MajorGit subprocess setup needs timeouts and stronger execution safety.
At Line 605–629, all git commands run with
check=Falseand no timeout, so hangs/failures are silently ignored. Also using"git"relies on PATH lookup (S607).Suggested fix
+ git_bin = shutil.which("git") or "git" - subprocess.run(["git", "init"], cwd=host_workdir, capture_output=True, check=False) + subprocess.run([git_bin, "init"], cwd=host_workdir, capture_output=True, check=True, timeout=30) @@ - ["git", "config", "user.email", "mcpbr@test.com"], + [git_bin, "config", "user.email", "mcpbr@test.com"], cwd=host_workdir, capture_output=True, - check=False, + check=True, + timeout=30, @@ - ["git", "config", "user.name", "MCPBR"], + [git_bin, "config", "user.name", "MCPBR"], cwd=host_workdir, capture_output=True, - check=False, + check=True, + timeout=30, @@ - ["git", "add", "-A"], + [git_bin, "add", "-A"], cwd=host_workdir, capture_output=True, - check=False, + check=True, + timeout=30, @@ - ["git", "commit", "-m", "Initial"], + [git_bin, "commit", "-m", "Initial"], cwd=host_workdir, capture_output=True, - check=False, + check=True, + timeout=30,#!/bin/bash # Verify current git subprocess patterns and compare with other benchmark implementations rg -n 'subprocess\.run\(' src/mcpbr/benchmarks/supermodel/benchmark.py -C2 rg -n 'subprocess\.run\(' src/mcpbr/benchmarks/deadcode.py -C2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 605 - 629, The git invocations in the subprocess.run block (the git init/config/add/commit calls using host_workdir) currently use check=False and no timeout and rely on "git" via PATH; update these calls to (1) resolve the git binary with shutil.which("git") and raise a clear error if not found, (2) use check=True (or call subprocess.run inside a try/except that catches subprocess.CalledProcessError) and (3) add a reasonable timeout (e.g. a few seconds) so failures and hangs surface instead of being silently ignored; apply this change to the sequence of subprocess.run calls that perform git init, git config, git add, and git commit in benchmark.py and ensure errors are logged/propagated appropriately.
83-83:⚠️ Potential issue | 🟠 MajorTemp workdir is leaked (no cleanup path).
Line 83 creates a process-lifetime temp dir and never removes it. This accumulates clones/zips across runs.
Suggested fix
class SupermodelBenchmark: @@ def __init__(...): @@ self._work_dir = Path(tempfile.mkdtemp(prefix="mcpbr_supermodel_")) + + def close(self) -> None: + if self._work_dir.exists(): + shutil.rmtree(self._work_dir, ignore_errors=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` at line 83, The temp workdir created by tempfile.mkdtemp at self._work_dir is leaked; change initialization to use a managed TemporaryDirectory or add an explicit cleanup path: create a tempfile.TemporaryDirectory() and store the context object (e.g., self._tmpdir_obj) and set self._work_dir = Path(self._tmpdir_obj.name), or keep the mkdtemp but implement a cleanup method (e.g., close/cleanup on your Benchmark class) that calls shutil.rmtree(self._work_dir) and register it with atexit.register or ensure callers invoke it (and update __del__ or context-manager support if appropriate) so the directory is removed when the process or object ends; reference symbols: self._work_dir, tempfile.mkdtemp, tempfile.TemporaryDirectory, shutil.rmtree, and add a cleanup/close method on the class.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 565-566: Replace the broad "except Exception as e" that logs
"Failed to get Supermodel analysis for {instance_id}" with narrow exception
handlers for the expected errors (for example catch RuntimeError, OSError and
json.JSONDecodeError explicitly) and log the same message inside those handlers
(referencing instance_id and logger.error); do not swallow other exceptions—let
unexpected exceptions propagate (or re-raise) so genuine bugs surface. Ensure
the handlers reference the same logging message and include the caught exception
details.
---
Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 29-33: The REPORT_PLACEHOLDER currently hardcodes a dead-code
schema (keys "dead_code" and "analysis_complete") which forces that shape for
all analysis endpoints and causes empty findings for non-dead-code modes; update
the logic that emits REPORT_PLACEHOLDER (symbol REPORT_PLACEHOLDER) so it builds
a placeholder dynamically based on the active endpoint (e.g., "impact_analysis",
"test_coverage", "circular_deps") or remove the dead_code key and return a
generic empty result structure per endpoint, ensuring the placeholder structure
matches each analysis type's expected keys rather than always using "dead_code".
- Around line 330-332: The code mutates the original task dict by assigning to
task["problem_statement"] which can leak prompt conditions between runs; instead
create a shallow copy of the task before modifying it (e.g., new_task =
task.copy()) and rewrite new_task["problem_statement"] using the existing
task.get("problem_statement_enhanced", task["problem_statement"]) expression,
then use new_task for downstream processing (refer to the variable task and the
"problem_statement" / "problem_statement_enhanced" keys in benchmark.py).
- Line 467: The variable reason_filtered is initialized but never incremented,
so update the code paths that determine a filtered reason (the branches that
later report reason_filtered at the points corresponding to the logic around
where reports are made) to increment reason_filtered whenever an item is
filtered for a reason; specifically, locate the conditional branches that set or
log the filtering reason (referencing the reason_filtered variable and the
report sites around lines where reports occur) and add reason_filtered += 1 in
each branch (or ensure the increment happens centrally where a filter decision
is made), taking care to respect scope (declare nonlocal/global if inside a
nested function) so the counter reflects actual filtered items.
- Around line 605-629: The git invocations in the subprocess.run block (the git
init/config/add/commit calls using host_workdir) currently use check=False and
no timeout and rely on "git" via PATH; update these calls to (1) resolve the git
binary with shutil.which("git") and raise a clear error if not found, (2) use
check=True (or call subprocess.run inside a try/except that catches
subprocess.CalledProcessError) and (3) add a reasonable timeout (e.g. a few
seconds) so failures and hangs surface instead of being silently ignored; apply
this change to the sequence of subprocess.run calls that perform git init, git
config, git add, and git commit in benchmark.py and ensure errors are
logged/propagated appropriately.
- Line 83: The temp workdir created by tempfile.mkdtemp at self._work_dir is
leaked; change initialization to use a managed TemporaryDirectory or add an
explicit cleanup path: create a tempfile.TemporaryDirectory() and store the
context object (e.g., self._tmpdir_obj) and set self._work_dir =
Path(self._tmpdir_obj.name), or keep the mkdtemp but implement a cleanup method
(e.g., close/cleanup on your Benchmark class) that calls
shutil.rmtree(self._work_dir) and register it with atexit.register or ensure
callers invoke it (and update __del__ or context-manager support if appropriate)
so the directory is removed when the process or object ends; reference symbols:
self._work_dir, tempfile.mkdtemp, tempfile.TemporaryDirectory, shutil.rmtree,
and add a cleanup/close method on the class.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 53afd5d4-a7e1-480c-b6b3-1d5d22be9b3f
📒 Files selected for processing (1)
src/mcpbr/benchmarks/supermodel/benchmark.py
| except Exception as e: | ||
| logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}") |
There was a problem hiding this comment.
Catch narrower exceptions here.
Line 565 catches Exception, which can mask programming bugs and make retries/debugging noisy. Catch expected failures (RuntimeError, OSError, json.JSONDecodeError) and let unexpected ones surface.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 565-565: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 565 - 566, Replace
the broad "except Exception as e" that logs "Failed to get Supermodel analysis
for {instance_id}" with narrow exception handlers for the expected errors (for
example catch RuntimeError, OSError and json.JSONDecodeError explicitly) and log
the same message inside those handlers (referencing instance_id and
logger.error); do not swallow other exceptions—let unexpected exceptions
propagate (or re-raise) so genuine bugs surface. Ensure the handlers reference
the same logging message and include the caught exception details.
Critical fixes: - Fix FP calculation in DeadCodeBenchmark: fp = len(found_set - dead_set) instead of len(found_set & alive_set) which undercounted false positives - Fix task dict mutation in SupermodelBenchmark.create_environment that broke A/B baseline/enhanced comparisons - Make get_pre_merge_commit async with timeout (was blocking event loop) - Add path containment check for corpus file writing (path traversal) - Pass API key via curl --config file instead of CLI arg (process table) Bug fixes: - Parameterize hardcoded "dead_code" key via endpoint.findings_key so non-dead-code analysis types (impact, circular-deps, etc.) work - Add timeouts to git subprocess calls in SupermodelBenchmark - Add json.loads error handling in API client (non-JSON responses) - Expand file glob from *.ts-only to include .tsx/.js/.jsx/.mjs by language - Add _work_dir temp directory cleanup via atexit handler - Log warning on silent git pull failure in corpus update Cleanup: - Extract shared utilities (_bench_utils.py): findings parser, git init, safe file writing — removes duplication between both benchmarks - Remove dead code: _score_and_cap_candidates, reason_filtered, unused prefilter_types parameter on DeadCodePlugin.parse_api_response - Narrow S608 lint suppression to specific files - Add dropped-item logging in evaluation.py build_comparison_set - Move inline imports (tempfile, os, Counter) to module level
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)
13-20:⚠️ Potential issue | 🟠 MajorGround truth is still counting type-only removals as dead code.
Lines 17-19 let
_parse_diff()capture removedinterface/type/enumexports, and Line 248 returns them as GT items unchanged. That conflicts with the prompt contract and with the later prefiltering insrc/mcpbr/benchmarks/supermodel/benchmark.py, so an agent can follow the instructions and still lose recall.Suggested fix
TS_PATTERNS = [ (r"^-\s*export\s+(?:async\s+)?function\s+(\w+)", "function"), (r"^-\s*export\s+class\s+(\w+)", "class"), (r"^-\s*export\s+const\s+(\w+)\s*[=:]", "const"), - (r"^-\s*export\s+interface\s+(\w+)", "interface"), - (r"^-\s*export\s+type\s+(\w+)\s*[=<{]", "type"), - (r"^-\s*export\s+enum\s+(\w+)", "enum"), (r"^-\s*export\s+default\s+(?:async\s+)?function\s+(\w+)", "function"), (r"^-\s*export\s+default\s+class\s+(\w+)", "class"), ]Also applies to: 237-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 13 - 20, The ground-truth generation is treating removed TypeScript-only symbols as dead code (TS_PATTERNS includes "interface"/"type"/"enum" and _parse_diff returns them as GT items), which conflicts with later prefiltering; update the logic so type-only removals are excluded: either remove the ("interface","type","enum") entries from TS_PATTERNS or add a filter inside _parse_diff to skip items whose kind is one of {"interface","type","enum"} before they are added to/returned as GT; ensure the change is applied where GT items are assembled/returned so benchmark prefiltering and _parse_diff agree.src/mcpbr/benchmarks/supermodel/benchmark.py (1)
365-385:⚠️ Potential issue | 🔴 CriticalEnhanced artifacts are always mounted, and failures are silently downgraded.
Unlike
DeadCodeBenchmark.create_environment(..., is_mcp: bool), this path always tries to materialize the Supermodel analysis in the workspace. Then Lines 522-528 just log and continue if that setup fails. So a baseline run can still see graph data when setup works, and an enhanced run can silently lose it when setup breaks. Both cases make the comparison noisy; this needs an explicit MCP/baseline gate and a fail-fast path for enhanced setup failures.Also applies to: 522-529
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 365 - 385, This always-attempts-to-materialize behavior should be gated and fail-fast: only attempt to load or call self._get_analysis when the environment is the enhanced/MCP path (use the is_mcp boolean from DeadCodeBenchmark.create_environment(..., is_mcp: bool) or the equivalent flag), and treat failures during that materialization as fatal for enhanced runs (raise/propagate an exception) instead of silently logging and continuing; for baseline runs skip materialization entirely so both baseline and enhanced runs are symmetric. Locate the logic around task.get("cached_analysis") and the call to self._get_analysis and add an explicit is_mcp check before trying to materialize, and change the error handling for the enhanced path to fail-fast rather than downgrading.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/_bench_utils.py`:
- Around line 27-33: The parser currently uses text.find(marker) which picks the
first occurrence (often a removed/`-` array); change the logic in the function
that computes marker/start/arr_start to locate the last valid array after the
key (use text.rfind(marker) or loop to find the last marker occurrence and then
search forward for the next '['), and similarly update the subsequent parsing
used in the block handling lines 38-60 so it selects the array that follows the
last marker (or prefers lines starting with '+' indicating added content) before
extracting the JSON array of findings.
In `@src/mcpbr/benchmarks/deadcode.py`:
- Around line 398-405: The reported "Found" count and returned found metric use
len(agent_findings) while scoring uses the deduplicated found_set; update all
places that log or return the found count to use len(found_set) instead of
len(agent_findings) (e.g., in the block that computes found_set, dead_set,
tp/fp/fn and where precision/recall are calculated), and apply the same
replacement for the later similar block (around the code that computes
precision/recall and calls compute_prf1()) so the logged/returned found value
matches the deduped set used for TP/FP/FN.
- Around line 220-224: The code in _load_raw_tasks pins corpus loading to a
single fixture by calling _load_corpus_task(self._corpus_dir,
"typescript-express-app"); change this to enumerate the task descriptors found
under the cloned corpus .benchmark/ directory (use self._corpus_dir from
_clone_or_update_corpus(self.corpus_path)), build tasks by calling
_load_corpus_task for each descriptor, and assign the resulting list to
self._tasks so sampling and filters apply across the whole corpus; ensure you
preserve existing behavior for filters/sample_size applied later in the method.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 207-265: The _generate_baseline_problem_statement method currently
hard-codes a dead-code template and ignores self._endpoint.baseline_prompt;
change it to first obtain the baseline text from self._endpoint.baseline_prompt
(use that when present) and only construct a fallback language-specific template
when baseline_prompt is None/empty; keep existing language detection (language,
ext) and language-specific hints but merge them into or append to the
endpoint-provided prompt rather than replacing it, and ensure the returned
string respects the ext variable and CRITICAL RULES format expected by
DeadCodePlugin.baseline_prompt.
- Around line 430-445: The entryPoints consumed from analysis_json are not
normalized the same way as candidates, so ep_set ((ep.get("file", ""),
ep.get("name", ""))) can contain paths like "src/foo/bar.ts" while candidates
have had scope_prefix removed (see scope_prefix handling around the
normalization code), causing mismatches; update the entry-point path
normalization to mirror candidate normalization: for each ep file in
entryPoints, strip the same scope_prefix using pathlib/Path semantics (not a raw
startswith) so you remove only a true directory prefix (use Path.relative_to or
compare parts) and produce the same file string format used when building
all_candidates, then rebuild ep_set from those normalized (file,name) tuples so
the filtering of all_candidates by ep_set works correctly.
In `@src/mcpbr/benchmarks/supermodel/git_utils.py`:
- Around line 25-38: The subprocess wait calls using asyncio.wait_for around
proc.communicate() (see proc variable in git clone/zip/gh invocations) need
timeout handling to avoid orphaned child processes: wrap each await
asyncio.wait_for(proc.communicate(), timeout=300) in a try/except
asyncio.TimeoutError, call proc.kill(), await proc.wait() to ensure the child is
reaped, then raise a clear RuntimeError (e.g., "Clone timed out" / "Archive
timed out" / "GH timed out") so the caller sees the timeout; apply the same
pattern to all similar subprocess invocations in this file (the ones using proc,
proc.communicate(), and proc.returncode checks).
---
Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 365-385: This always-attempts-to-materialize behavior should be
gated and fail-fast: only attempt to load or call self._get_analysis when the
environment is the enhanced/MCP path (use the is_mcp boolean from
DeadCodeBenchmark.create_environment(..., is_mcp: bool) or the equivalent flag),
and treat failures during that materialization as fatal for enhanced runs
(raise/propagate an exception) instead of silently logging and continuing; for
baseline runs skip materialization entirely so both baseline and enhanced runs
are symmetric. Locate the logic around task.get("cached_analysis") and the call
to self._get_analysis and add an explicit is_mcp check before trying to
materialize, and change the error handling for the enhanced path to fail-fast
rather than downgrading.
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 13-20: The ground-truth generation is treating removed
TypeScript-only symbols as dead code (TS_PATTERNS includes
"interface"/"type"/"enum" and _parse_diff returns them as GT items), which
conflicts with later prefiltering; update the logic so type-only removals are
excluded: either remove the ("interface","type","enum") entries from TS_PATTERNS
or add a filter inside _parse_diff to skip items whose kind is one of
{"interface","type","enum"} before they are added to/returned as GT; ensure the
change is applied where GT items are assembled/returned so benchmark
prefiltering and _parse_diff agree.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ad5facef-0a89-4d95-b99b-a06d83f04f59
📒 Files selected for processing (10)
pyproject.tomlsrc/mcpbr/benchmarks/_bench_utils.pysrc/mcpbr/benchmarks/codegraph.pysrc/mcpbr/benchmarks/deadcode.pysrc/mcpbr/benchmarks/supermodel/api_client.pysrc/mcpbr/benchmarks/supermodel/benchmark.pysrc/mcpbr/benchmarks/supermodel/endpoints/base.pysrc/mcpbr/benchmarks/supermodel/endpoints/dead_code.pysrc/mcpbr/benchmarks/supermodel/evaluation.pysrc/mcpbr/benchmarks/supermodel/git_utils.py
✅ Files skipped from review due to trivial changes (3)
- src/mcpbr/benchmarks/codegraph.py
- src/mcpbr/benchmarks/supermodel/evaluation.py
- src/mcpbr/benchmarks/supermodel/api_client.py
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
| marker = f'"{findings_key}"' | ||
| start = text.find(marker) | ||
| if start == -1: | ||
| return findings | ||
| arr_start = text.find("[", start) | ||
| if arr_start == -1: | ||
| return findings |
There was a problem hiding this comment.
Diff fallback will usually read the deleted REPORT.json array.
This parser grabs the first "{findings_key}" array it sees. In a normal unified diff, that is often the removed placeholder like - "dead_code": [], so fallback scoring returns an empty list even when the patch later adds real findings. Please prefer the last valid array after the key, or explicitly handle added diff lines.
Also applies to: 38-60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/_bench_utils.py` around lines 27 - 33, The parser
currently uses text.find(marker) which picks the first occurrence (often a
removed/`-` array); change the logic in the function that computes
marker/start/arr_start to locate the last valid array after the key (use
text.rfind(marker) or loop to find the last marker occurrence and then search
forward for the next '['), and similarly update the subsequent parsing used in
the block handling lines 38-60 so it selects the array that follows the last
marker (or prefers lines starting with '+' indicating added content) before
extracting the JSON array of findings.
| # Load from corpus repository | ||
| try: | ||
| self._corpus_dir = _clone_or_update_corpus(self.corpus_path) | ||
| task = _load_corpus_task(self._corpus_dir, "typescript-express-app") | ||
| self._tasks = [task] |
There was a problem hiding this comment.
Corpus mode is pinned to a single fixture.
_load_raw_tasks() clones the corpus and then always loads typescript-express-app. That means the rest of the corpus never participates, and sample_size / task filters mostly just slice one item. If this benchmark is meant to represent the corpus, it should enumerate the task descriptors under .benchmark/ instead of hard-coding one case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/deadcode.py` around lines 220 - 224, The code in
_load_raw_tasks pins corpus loading to a single fixture by calling
_load_corpus_task(self._corpus_dir, "typescript-express-app"); change this to
enumerate the task descriptors found under the cloned corpus .benchmark/
directory (use self._corpus_dir from _clone_or_update_corpus(self.corpus_path)),
build tasks by calling _load_corpus_task for each descriptor, and assign the
resulting list to self._tasks so sampling and filters apply across the whole
corpus; ensure you preserve existing behavior for filters/sample_size applied
later in the method.
| found_set = {(f.get("file", ""), f.get("name", "")) for f in agent_findings} | ||
| dead_set = {(d.get("file", ""), d.get("name", "")) for d in expected_dead} | ||
|
|
||
| tp = len(found_set & dead_set) | ||
| fp = len(found_set - dead_set) | ||
| fn = len(dead_set - found_set) | ||
|
|
||
| precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0 |
There was a problem hiding this comment.
found should use the deduped prediction set.
The actual scoring uses found_set, so duplicates do not change TP/FP/FN. But the log output and returned found metric still use len(agent_findings). A response that reports the same symbol twice will print a bigger "Found" count than the one used for precision/recall. Reporting len(found_set) keeps the metrics internally consistent and aligned with compute_prf1().
Also applies to: 412-433
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/deadcode.py` around lines 398 - 405, The reported
"Found" count and returned found metric use len(agent_findings) while scoring
uses the deduplicated found_set; update all places that log or return the found
count to use len(found_set) instead of len(agent_findings) (e.g., in the block
that computes found_set, dead_set, tp/fp/fn and where precision/recall are
calculated), and apply the same replacement for the later similar block (around
the code that computes precision/recall and calls compute_prf1()) so the
logged/returned found value matches the deduped set used for TP/FP/FN.
| def _generate_baseline_problem_statement(self, task_cfg: dict) -> str: | ||
| """Generate problem statement for the baseline (manual analysis) condition. | ||
|
|
||
| The agent must find dead code by reading and searching the codebase directly. | ||
| """ | ||
| language = task_cfg.get("language", "typescript") | ||
|
|
||
| ext = ".ts" if language == "typescript" else ".py" | ||
| if language == "python": | ||
| lang_hints = """- Functions in __all__ that are never actually imported by other modules | ||
| - Cleanup/utility functions whose associated state is never populated""" | ||
| else: | ||
| lang_hints = """- Exported functions/classes that are never imported by any other module | ||
| - Middleware or handlers that are defined but never registered with the router | ||
| - Methods on classes where the class itself is never instantiated from live code""" | ||
|
|
||
| return f"""You are a code analyst. Find dead code in this {language} codebase. | ||
|
|
||
| Dead code = exported functions, classes, methods, interfaces, and constants that | ||
| are defined but never used in any meaningful execution path. | ||
|
|
||
| {lang_hints} | ||
|
|
||
| == STRATEGY == | ||
|
|
||
| STEP 1: Get an overview of the codebase structure. | ||
| - List the top-level directories and key source files. | ||
| - Identify the main source directories (exclude node_modules, dist, build, tests). | ||
|
|
||
| STEP 2: Scan source files for exported symbols. | ||
| - Focus on non-test, non-generated source files. | ||
| - For each file, note exported functions, classes, interfaces, constants. | ||
|
|
||
| STEP 3: For each exported symbol, grep the codebase for references. | ||
| - If it only appears in its own definition file (and possibly tests or | ||
| barrel/index re-exports), it is likely dead. | ||
| - Barrel re-exports (index.ts) do NOT count as real usage. | ||
| - Type-only imports do NOT count as real usage. | ||
|
|
||
| STEP 4: Write REPORT.json EARLY (after scanning even a few files). | ||
| - Write what you have so far, then continue scanning and UPDATE the file. | ||
| - This ensures you always produce output even if you run out of iterations. | ||
|
|
||
| REPORT.json format: | ||
| {{ | ||
| "dead_code": [ | ||
| {{"file": "path/to/file{ext}", "name": "unusedFunc", "type": "function", "reason": "no callers found"}}, | ||
| ... | ||
| ], | ||
| "analysis_complete": true | ||
| }} | ||
|
|
||
| CRITICAL RULES: | ||
| - Type should be one of: function, class, method, const, interface, variable. | ||
| - When in doubt about whether something is dead, INCLUDE it. | ||
| - False positives are acceptable. Missing real dead code is NOT acceptable. | ||
| - Write REPORT.json after analyzing each batch — do NOT wait until the end. | ||
| - Prioritize breadth over depth: scan ALL source files before deep-diving any one.""" | ||
|
|
There was a problem hiding this comment.
The baseline prompt is still hard-coded to dead-code rules.
This bypasses self._endpoint.baseline_prompt entirely. Even for analysis_type="dead-code", Line 225 tells the model to include interfaces, which contradicts DeadCodePlugin.baseline_prompt; for any other endpoint, it is just the wrong task. The benchmark should build the baseline prompt from the selected plugin instead of a dead-code-specific template.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 207 - 265, The
_generate_baseline_problem_statement method currently hard-codes a dead-code
template and ignores self._endpoint.baseline_prompt; change it to first obtain
the baseline text from self._endpoint.baseline_prompt (use that when present)
and only construct a fallback language-specific template when baseline_prompt is
None/empty; keep existing language detection (language, ext) and
language-specific hints but merge them into or append to the endpoint-provided
prompt rather than replacing it, and ensure the returned string respects the ext
variable and CRITICAL RULES format expected by DeadCodePlugin.baseline_prompt.
| # Build entry point set for cross-reference filtering | ||
| ep_set = set() | ||
| for ep in analysis_json.get("entryPoints", []): | ||
| ep_file = ep.get("file", "") | ||
| ep_name = ep.get("name", "") | ||
| if ep_file and ep_name: | ||
| ep_set.add((ep_file, ep_name)) | ||
|
|
||
| # Drop candidates that match entry points | ||
| if ep_set: | ||
| before_ep = len(all_candidates) | ||
| all_candidates = [ | ||
| c | ||
| for c in all_candidates | ||
| if (c.get("file", ""), c.get("name", "")) not in ep_set | ||
| ] |
There was a problem hiding this comment.
Scoped runs are normalizing candidates and entry points differently.
Line 604 strips scope_prefix from candidate file paths, but the entryPoints consumed at Lines 432-445 keep the old paths. A candidate can become foo/bar.ts while the matching entry point stays src/foo/bar.ts, so the live symbol no longer filters out. Also, raw startswith() will happily strip src/app from src/application/...; use a real directory boundary or Path logic here.
Also applies to: 604-611
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 430 - 445, The
entryPoints consumed from analysis_json are not normalized the same way as
candidates, so ep_set ((ep.get("file", ""), ep.get("name", ""))) can contain
paths like "src/foo/bar.ts" while candidates have had scope_prefix removed (see
scope_prefix handling around the normalization code), causing mismatches; update
the entry-point path normalization to mirror candidate normalization: for each
ep file in entryPoints, strip the same scope_prefix using pathlib/Path semantics
(not a raw startswith) so you remove only a true directory prefix (use
Path.relative_to or compare parts) and produce the same file string format used
when building all_candidates, then rebuild ep_set from those normalized
(file,name) tuples so the filtering of all_candidates by ep_set works correctly.
| proc = await asyncio.create_subprocess_exec( | ||
| "git", | ||
| "clone", | ||
| "--quiet", | ||
| "--depth", | ||
| "1", | ||
| clone_url, | ||
| dest, | ||
| stdout=asyncio.subprocess.PIPE, | ||
| stderr=asyncio.subprocess.PIPE, | ||
| ) | ||
| _, stderr = await asyncio.wait_for(proc.communicate(), timeout=300) | ||
| if proc.returncode != 0: | ||
| raise RuntimeError(f"Clone failed: {stderr.decode()}") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Python asyncio, if asyncio.wait_for(proc.communicate(), timeout=...)times out for a subprocess created withasyncio.create_subprocess_exec, is the child process automatically terminated, or does it need to be killed explicitly?
💡 Result:
No. asyncio.wait_for(proc.communicate(), timeout=...) only cancels the awaiting Task when the timeout expires—it does not automatically terminate the underlying OS subprocess, so the child can keep running unless you stop it yourself. The docs explicitly describe wait_for() as cancelling the task on timeout, not killing external resources. [2]
So on timeout you should explicitly terminate/kill the process (and then await proc.wait() to reap it), e.g.:
try:
out, err = await asyncio.wait_for(proc.communicate(), timeout=5)
except TimeoutError:
proc.kill() # or proc.terminate()
await proc.wait()
raiseasyncio provides Process.terminate() / Process.kill() for this purpose. [1] Reports/discussion around asyncio subprocess timeouts also note that a wait_for() timeout doesn’t kill the subprocess automatically. [3]
Sources: [1] asyncio subprocess Process.kill/terminate docs, [2] asyncio.wait_for() timeout/cancellation docs, [3] Python issue discussion on subprocess timeout behavior.
Add cleanup for subprocesses that timeout.
When asyncio.wait_for() times out, it cancels the waiting task—but it doesn't automatically kill the child process. So if git clone (or gh or zip) hangs and hits the 300-second timeout, the process keeps running in the background forever. On a long benchmark batch, you'll pile up zombie processes and run out of file descriptors. Bad times.
Here's the fix: wrap proc.communicate() in a try-except, catch the timeout, kill the process, wait for it to finish, then re-raise:
try:
_, stderr = await asyncio.wait_for(proc.communicate(), timeout=300)
except asyncio.TimeoutError:
proc.kill()
await proc.wait()
raise RuntimeError(f"Clone timed out")This applies to all subprocess calls in the file: lines 25–38, 40–54, 56–67, 80–92, 136–145, 185–193.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/git_utils.py` around lines 25 - 38, The
subprocess wait calls using asyncio.wait_for around proc.communicate() (see proc
variable in git clone/zip/gh invocations) need timeout handling to avoid
orphaned child processes: wrap each await asyncio.wait_for(proc.communicate(),
timeout=300) in a try/except asyncio.TimeoutError, call proc.kill(), await
proc.wait() to ensure the child is reaped, then raise a clear RuntimeError
(e.g., "Clone timed out" / "Archive timed out" / "GH timed out") so the caller
sees the timeout; apply the same pattern to all similar subprocess invocations
in this file (the ones using proc, proc.communicate(), and proc.returncode
checks).
Precision is not a fair gate for dead-code tasks: the API surfaces many valid candidates beyond the PR's ground-truth set, so max achievable precision is bounded by |GT| / |candidates| — often well below 80% for large repos regardless of agent quality. Recall (did the agent find the items the PR deleted?) is the meaningful signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The enhanced_prompt_v2 filter script was dropping any candidate whose reason contained "no references", which is the primary signal for dead exports. This caused the MCP agent to discard ~99% of valid candidates, reporting only 14 out of 1,900 for typescript_pr56817 (6.5% recall). Only drop pure type/interface candidates, which have elevated FP rates from TypeScript structural typing. "No references" on its own means the symbol is genuinely unused and should be kept. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove interface/type/enum from TS_PATTERNS so GT only contains runtime symbols (functions, classes, consts) — aligns with the prompt contract that tells agents not to report type-only declarations, and eliminates a class of GT/agent-output mismatches that lower recall - Add ge/le constraints on resolved_threshold in config (must be 0–1) - Update resolved_threshold description to reflect recall-only criterion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/mcpbr/benchmarks/supermodel/benchmark.py (3)
209-266:⚠️ Potential issue | 🟠 MajorThe baseline prompt is still hardcoded and doesn't use
self._endpoint.baseline_prompt.This was flagged in a previous review. The issue is:
- Line 225-228 tells the agent to find "interfaces" as dead code
- But
DeadCodePlugin.baseline_prompt(lines 90-91 indead_code.py) explicitly says "Do NOT include: Type definitions, interfaces, or enums"So the baseline agent gets conflicting instructions depending on which prompt actually gets used. Since
_generate_baseline_problem_statementcompletely ignoresself._endpoint.baseline_prompt, you're not leveraging the endpoint's prompt at all.A simple fix would be to delegate to the endpoint:
🔧 Suggested fix
def _generate_baseline_problem_statement(self, task_cfg: dict) -> str: - """Generate problem statement for the baseline (manual analysis) condition. - - The agent must find dead code by reading and searching the codebase directly. - """ - language = task_cfg.get("language", "typescript") - - ext = ".ts" if language == "typescript" else ".py" - # ... rest of hardcoded template ... + """Generate problem statement for the baseline (manual analysis) condition.""" + # Use the endpoint's baseline prompt if available + scope_prefix = task_cfg.get("scope_prefix") + return self._endpoint.scope_prompt(self._endpoint.baseline_prompt, scope_prefix)If you need language-specific customization on top of the endpoint prompt, you could append those hints rather than replacing the entire prompt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 209 - 266, The baseline prompt is hardcoded in _generate_baseline_problem_statement which conflicts with DeadCodePlugin.baseline_prompt; change _generate_baseline_problem_statement to start from self._endpoint.baseline_prompt (or use it as the base string) and then append any language-specific hints/extension (e.g., the lang_hints for typescript/python) instead of replacing or duplicating instructions, ensuring you remove or reconcile the hardcoded "interfaces" line so it doesn't contradict DeadCodePlugin.baseline_prompt; locate the method _generate_baseline_problem_statement and DeadCodePlugin.baseline_prompt to implement this delegation/append behavior.
432-447:⚠️ Potential issue | 🟠 MajorPath normalization mismatch between entry points and candidates in scoped runs.
This was flagged in a previous review and appears to still be an issue. Here's the problem in simple terms:
What happens:
- Line 608-613: Candidate file paths get
scope_prefixstripped (e.g.,src/app/foo.ts→app/foo.ts)- Lines 434-438: Entry point paths are NOT stripped (still
src/app/foo.ts)The bug:
When you checkif (c.get("file", ""), c.get("name", "")) not in ep_setat line 446, a candidate with pathapp/foo.tswon't match an entry point with pathsrc/app/foo.ts, so live symbols won't get filtered out.Also,
startswith(prefix)at line 612 is risky—it would incorrectly stripsrc/appfromsrc/application/utils.ts.🔧 Suggested fix for safer prefix stripping
if scope_prefix and strip_prefix: - prefix = scope_prefix.rstrip("/") + "/" + # Use path parts comparison to avoid partial directory name matches + from pathlib import PurePosixPath + prefix_parts = PurePosixPath(scope_prefix).parts for key in ("deadCodeCandidates", "candidates", "items"): if key in result: for item in result[key]: fp = item.get("file", "") - if fp.startswith(prefix): - item["file"] = fp[len(prefix):] + fp_path = PurePosixPath(fp) + if fp_path.parts[:len(prefix_parts)] == prefix_parts: + item["file"] = str(PurePosixPath(*fp_path.parts[len(prefix_parts):]))Also applies to: 604-613
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 432 - 447, Entry-point file paths are not normalized the same as candidate files, so (file,name) tuples in ep_set won't match candidates when scope_prefix was stripped; update the ep_set-building loop that reads analysis_json.get("entryPoints", []) to apply the same normalization used for candidates (use the same scope_prefix variable and the same path normalization logic), and change any naive startswith(prefix) checks to a safe strip that only removes the prefix when it matches a directory boundary (e.g., check prefix == path or path.startswith(prefix + os.sep) after os.path.normpath/os.path.sep normalization) so entries in ep_set and all_candidates compare against identical normalized file paths. Ensure you reference and reuse the same normalization utility or code used where candidates are processed (the logic around scope_prefix and all_candidates) so both sides are consistent.
524-530:⚠️ Potential issue | 🟡 MinorCatch narrower exceptions instead of bare
Exception.The static analysis tool flagged this, and it was also mentioned in a previous review. Catching
Exceptioncan hide bugs likeKeyErrororAttributeErrorthat indicate programming errors.The likely exceptions here are:
RuntimeErrorfrom API failuresOSError/IOErrorfrom file operationsjson.JSONDecodeErrorfrom parsing🔧 Suggested fix
- except Exception as e: + except (RuntimeError, OSError, json.JSONDecodeError) as e: logger.error(f"Failed to get Supermodel analysis for {instance_id}: {e}")This way, if there's an unexpected
TypeErrororKeyErrorfrom a bug in your code, it'll bubble up and you'll know about it instead of silently continuing with a broken state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 524 - 530, The current broad except block that logs "Failed to get Supermodel analysis" catches all Exceptions (e.g., the line using "except Exception as e:") which can hide programming errors; replace it with a narrowed catch that only handles expected failures such as RuntimeError, OSError (or IOError) and json.JSONDecodeError around the Supermodel analysis retrieval (reference the variables instance_id, logger, traceback in this block), e.g., catch those specific exception types and log/print the error, and allow any other exception (TypeError, KeyError, AttributeError, etc.) to propagate so they surface during testing; if you must explicitly handle unexpected exceptions, re-raise them after logging.
🧹 Nitpick comments (3)
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py (1)
199-232: Consider precompiling the regex patterns for better performance.Right now,
framework_namesandframework_filesare compiled fresh on every call toparse_api_response(). If you're processing many candidates across multiple tasks, moving these to module-level constants saves repeated compilation:♻️ Optional refactor
+# Module-level compiled patterns for parse_api_response +FRAMEWORK_NAMES_RE = re.compile( + r"^(execute|up|down|Template|Story|stories|test|Test|Mock|mock|" + r"Fixture|Spec|Suite|describe|it|expect|beforeEach|afterEach|" + r"setUp|tearDown|default|module|exports|require)$" +) +FRAMEWORK_FILES_RE = re.compile( + r"(\.test\.|\.spec\.|\.stories\.|__tests__|__mocks__|" + r"\.storybook|\.e2e\.|migrations/|\.d\.ts$)" +) + class DeadCodePlugin(EndpointPlugin): # ... def parse_api_response(self, response: dict) -> dict: candidates = response.get("deadCodeCandidates", []) - framework_names = re.compile(...) - framework_files = re.compile(...) - filtered = [] for c in candidates: name = c.get("name", "") filepath = c.get("file", "") - if framework_names.match(name): + if FRAMEWORK_NAMES_RE.match(name): continue - if framework_files.search(filepath): + if FRAMEWORK_FILES_RE.search(filepath): continue filtered.append(c)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py` around lines 199 - 232, The regex patterns framework_names and framework_files are being compiled on every call to parse_api_response which is wasteful; move those re.compile(...) calls out of parse_api_response into module-level constants (e.g. FRAMEWORK_NAMES, FRAMEWORK_FILES) and then update parse_api_response to reference those constants instead of recreating the patterns so regex compilation happens once at import time.src/mcpbr/config.py (1)
877-913: LGTM on the new Supermodel config fields, with one optional suggestion.The validation on
resolved_thresholdusing Pydantic'sge/leconstraints is solid. The docstrings are clear and the defaults make sense.One thing to consider:
analysis_typeaccepts any string, but based on the PR description, only specific values are valid (dead-code,impact,test-coverage,circular-deps). You could add a field validator similar tovalidate_benchmarkto catch typos early:VALID_ANALYSIS_TYPES = ("dead-code", "impact", "test-coverage", "circular-deps") `@field_validator`("analysis_type") `@classmethod` def validate_analysis_type(cls, v: str | None) -> str | None: if v is not None and v not in VALID_ANALYSIS_TYPES: raise ValueError(f"Invalid analysis_type: {v}. Must be one of: {', '.join(VALID_ANALYSIS_TYPES)}") return vThis way if someone types
"dead_code"(underscore) instead of"dead-code"(hyphen), they get a helpful error instead of a confusing runtime failure later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/config.py` around lines 877 - 913, Add strict validation for the analysis_type field: define a VALID_ANALYSIS_TYPES tuple containing "dead-code", "impact", "test-coverage", "circular-deps" and add a field validator (named e.g. validate_analysis_type) decorated with `@field_validator`("analysis_type") `@classmethod` that checks if analysis_type is not None and is in VALID_ANALYSIS_TYPES, raising a ValueError with a helpful message if not; mirror the style of the existing validate_benchmark implementation so typos like "dead_code" are caught at config validation time.src/mcpbr/benchmarks/supermodel/benchmark.py (1)
84-162: Heads up:filter_categoryfilters onlanguagefield, which might be confusing.At line 155-156:
if filter_category: category_set = set(filter_category) tasks = [t for t in tasks if t.get("language", "typescript") in category_set]So if someone passes
--filter-category python, they're actually filtering by language. The naming is a bit misleading. Consider either:
- Renaming this to
filter_languageif that's the intent- Adding a separate
categoryfield to tasks if you want to support bothNot blocking, just something to be aware of for future maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/supermodel/benchmark.py` around lines 84 - 162, The current filter_category parameter in load_tasks actually filters the task "language" key, which is confusing; either rename the parameter to filter_language everywhere (function signature of load_tasks, any callers, CLI flags, and docs) and adjust local variable names (e.g., category_set -> language_set) so the intent matches the code, or add a real "category" field into the task dict (in the task creation block where keys like "language" and "description" are set) and change the filtering to check t.get("category") for filter_category; update corresponding tests/CLI callers accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 209-266: The baseline prompt is hardcoded in
_generate_baseline_problem_statement which conflicts with
DeadCodePlugin.baseline_prompt; change _generate_baseline_problem_statement to
start from self._endpoint.baseline_prompt (or use it as the base string) and
then append any language-specific hints/extension (e.g., the lang_hints for
typescript/python) instead of replacing or duplicating instructions, ensuring
you remove or reconcile the hardcoded "interfaces" line so it doesn't contradict
DeadCodePlugin.baseline_prompt; locate the method
_generate_baseline_problem_statement and DeadCodePlugin.baseline_prompt to
implement this delegation/append behavior.
- Around line 432-447: Entry-point file paths are not normalized the same as
candidate files, so (file,name) tuples in ep_set won't match candidates when
scope_prefix was stripped; update the ep_set-building loop that reads
analysis_json.get("entryPoints", []) to apply the same normalization used for
candidates (use the same scope_prefix variable and the same path normalization
logic), and change any naive startswith(prefix) checks to a safe strip that only
removes the prefix when it matches a directory boundary (e.g., check prefix ==
path or path.startswith(prefix + os.sep) after os.path.normpath/os.path.sep
normalization) so entries in ep_set and all_candidates compare against identical
normalized file paths. Ensure you reference and reuse the same normalization
utility or code used where candidates are processed (the logic around
scope_prefix and all_candidates) so both sides are consistent.
- Around line 524-530: The current broad except block that logs "Failed to get
Supermodel analysis" catches all Exceptions (e.g., the line using "except
Exception as e:") which can hide programming errors; replace it with a narrowed
catch that only handles expected failures such as RuntimeError, OSError (or
IOError) and json.JSONDecodeError around the Supermodel analysis retrieval
(reference the variables instance_id, logger, traceback in this block), e.g.,
catch those specific exception types and log/print the error, and allow any
other exception (TypeError, KeyError, AttributeError, etc.) to propagate so they
surface during testing; if you must explicitly handle unexpected exceptions,
re-raise them after logging.
---
Nitpick comments:
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 84-162: The current filter_category parameter in load_tasks
actually filters the task "language" key, which is confusing; either rename the
parameter to filter_language everywhere (function signature of load_tasks, any
callers, CLI flags, and docs) and adjust local variable names (e.g.,
category_set -> language_set) so the intent matches the code, or add a real
"category" field into the task dict (in the task creation block where keys like
"language" and "description" are set) and change the filtering to check
t.get("category") for filter_category; update corresponding tests/CLI callers
accordingly.
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 199-232: The regex patterns framework_names and framework_files
are being compiled on every call to parse_api_response which is wasteful; move
those re.compile(...) calls out of parse_api_response into module-level
constants (e.g. FRAMEWORK_NAMES, FRAMEWORK_FILES) and then update
parse_api_response to reference those constants instead of recreating the
patterns so regex compilation happens once at import time.
In `@src/mcpbr/config.py`:
- Around line 877-913: Add strict validation for the analysis_type field: define
a VALID_ANALYSIS_TYPES tuple containing "dead-code", "impact", "test-coverage",
"circular-deps" and add a field validator (named e.g. validate_analysis_type)
decorated with `@field_validator`("analysis_type") `@classmethod` that checks if
analysis_type is not None and is in VALID_ANALYSIS_TYPES, raising a ValueError
with a helpful message if not; mirror the style of the existing
validate_benchmark implementation so typos like "dead_code" are caught at config
validation time.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fe28a310-fdcb-490d-933c-5942ffdca6fe
📒 Files selected for processing (3)
src/mcpbr/benchmarks/supermodel/benchmark.pysrc/mcpbr/benchmarks/supermodel/endpoints/dead_code.pysrc/mcpbr/config.py
Summary
SupermodelBenchmark— a PR-based dead-code detection benchmark that uses GitHub PRs as ground truth and the Supermodel API for pre-computed dependency graph analysisDeadCodeBenchmark— a corpus-based variant for offline evaluationNew files
src/mcpbr/benchmarks/supermodel/— full benchmark packagebenchmark.py— SupermodelBenchmark class (PR-based GT extraction, Supermodel API, chunked analysis v3)api_client.py— async polling client for the Supermodel APIevaluation.py— P/R/F1 scoringgit_utils.py—clone_repo_at_commit,zip_repohelpersendpoints/— pluggable endpoint system (dead_codeimplemented;impact,test-coverage,circular-depsstubbed)src/mcpbr/benchmarks/deadcode.py— corpus-basedDeadCodeBenchmarkvariantConfig fields added
Other changes
models.py/pricing.py— addclaude-opus-4-6andclaude-sonnet-4-6cli.py— microsecond timestamps for output dirs (avoids collisions on concurrent runs)docker_env.py+Dockerfile— use MCR mirror forpython:3.11-slimto work around Docker Hub rate limitingpyproject.toml—S608per-file ignore forbenchmarks/(false positive on prompt template f-strings)Test plan
python3 -m mcpbr run -c <config> --mcp-only -t directus_pr26311 --trial-mode -vvcompletes without errorbenchmark: "dead-code"andbenchmark: "supermodel"configs both validate🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
New Models
Chores