fix(supermodel): remove GT and analysis caching — always run fresh (#714)#9
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
WalkthroughThis PR removes ground truth and analysis caching from the Supermodel benchmark, shifting reliance to server-side idempotency-key deduplication. The API client is streamlined with direct API-key headers instead of temporary config files. Dead-code ground-truth extraction now applies feature-removal false-positive filtering to exclude symbols co-removed with consumers while retaining orphaned deletions. File-path handling differentiates between MCP and baseline runs, and evaluation metrics now support alive-code-based false-positive computation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/mcpbr/harness.py (1)
277-281: Re-serialize extra eval fields before passing them through.
eval_resulthas already gone throughdict_to_namespace(), so a nested extra metric like{"precision": {"raw": 0.9}}reaches this loop as nestedSimpleNamespaceobjects. Copying that verbatim makes the new pass-through path non-JSON-safe the moment a benchmark returns structured extras.♻️ Possible fix
+def _to_plain_json(value: Any) -> Any: + if isinstance(value, SimpleNamespace): + return {k: _to_plain_json(v) for k, v in vars(value).items()} + if isinstance(value, list): + return [_to_plain_json(v) for v in value] + return value + # Pass through any extra fields (e.g. precision/recall from custom benchmarks) _known = {"resolved", "patch_applied", "fail_to_pass", "pass_to_pass", "error"} for k, v in vars(eval_result).items(): if k not in _known and k not in data: - data[k] = v + data[k] = _to_plain_json(v)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/harness.py` around lines 277 - 281, The extra fields from eval_result are still Namespace/SimpleNamespace objects (because dict_to_namespace was applied) and must be converted to JSON-safe structures before adding to data; in the loop that iterates vars(eval_result) (the block referencing _known and for k,v in vars(eval_result).items()), replace the direct assignment data[k] = v with a JSON-safe serialization step: detect Namespace/SimpleNamespace (or other non-primitive types) and recursively convert them to dict/list/primitive types (e.g. via a small helper like namespace_to_dict or using json.dumps+json.loads with a default that calls vars()) so nested metrics (e.g. {"precision": {"raw": 0.9}}) become plain dicts/lists and are safe to pass through.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 10-27: Add a new [Unreleased] CHANGELOG.md entry describing the
user-visible result-semantics change: note that Supermodel now can set fp_mode
to "vs alive set" and that the `resolved` field is no longer gated by recall
(i.e., it may be set regardless of recall), and indicate these affect run output
and result JSON; place this entry alongside the existing cache/GT fixes so users
see both behavioral changes and the caching removal.
In `@src/mcpbr/benchmarks/supermodel/api_client.py`:
- Around line 54-58: The current code appends the API key into the curl command
arguments (via the headers list and upload_cmd) exposing it in process argv;
replace the subprocess/curl usage with a Python HTTP client so the API key is
only set in an in-memory header. Locate the places constructing headers and
upload_cmd and the polling loop that builds curl commands (references: headers,
upload_cmd, zip_path, url, and the polling logic around line ~105) and
reimplement the POST file upload and subsequent status poll using an async HTTP
client (httpx or aiohttp) or synchronous requests, setting "X-Api-Key" in the
request headers rather than any command-line, and remove all
subprocess.Popen/args construction that included the key.
In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 607-627: The current evaluate() implementation reads entryPoints
from a workspace file under env.host_workdir (via analysis_path/hidden_path and
self._endpoint.analysis_filename) which is agent-writable and truncated; instead
load the full, trusted alive set from host-side state prepared in
create_environment() (e.g. store the complete entryPoints there) and use that
when computing metrics. Concretely: stop opening files under env.host_workdir in
evaluate(), remove the analysis_path/hidden_path logic, read alive_code from the
host-side attribute you populate in create_environment() (e.g.
self._analysis_data or a new self._endpoint.host_alive_entrypoints) and pass
that alive_code into compute_prf1(agent_findings, ground_truth, key_fields,
alive_code=alive_code).
In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 19-21: extract_ground_truth() now emits interfaces/types/enums but
baseline_prompt still instructs agents to exclude them and
SupermodelBenchmark.create_environment() strips those candidates, causing GT vs
prompt/filter mismatch; either remove those symbols from extract_ground_truth()
or (recommended) update baseline_prompt to permit interfaces/types/enums and
stop stripping them in SupermodelBenchmark.create_environment() so GT items are
attainable — specifically, modify baseline_prompt text to remove the “Do NOT
include interface/type/enum” wording and change the filtering logic in
SupermodelBenchmark.create_environment() that currently removes
interface/type/enum candidates (and any regex/filters tied to patterns like
r"^-\s*export\s+interface\s+(\w+)" / r"^-\s*export\s+type\s+(\w+)\s*[={<]" /
r"^-\s*export\s+(?:const\s+)?enum\s+(\w+)") so those symbols are retained when
composing the MCP analysis payload.
---
Nitpick comments:
In `@src/mcpbr/harness.py`:
- Around line 277-281: The extra fields from eval_result are still
Namespace/SimpleNamespace objects (because dict_to_namespace was applied) and
must be converted to JSON-safe structures before adding to data; in the loop
that iterates vars(eval_result) (the block referencing _known and for k,v in
vars(eval_result).items()), replace the direct assignment data[k] = v with a
JSON-safe serialization step: detect Namespace/SimpleNamespace (or other
non-primitive types) and recursively convert them to dict/list/primitive types
(e.g. via a small helper like namespace_to_dict or using json.dumps+json.loads
with a default that calls vars()) so nested metrics (e.g. {"precision": {"raw":
0.9}}) become plain dicts/lists and are safe to pass through.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e666fe8d-40d5-4333-8c95-cb80378d77ce
📒 Files selected for processing (8)
CHANGELOG.mdsrc/mcpbr/benchmarks/supermodel/api_client.pysrc/mcpbr/benchmarks/supermodel/benchmark.pysrc/mcpbr/benchmarks/supermodel/endpoints/dead_code.pysrc/mcpbr/benchmarks/supermodel/evaluation.pysrc/mcpbr/benchmarks/supermodel/git_utils.pysrc/mcpbr/harness.pytests/test_supermodel_dead_code.py
| ### Fixed | ||
|
|
||
| - **Remove GT and analysis caching from SupermodelBenchmark** (supermodeltools/supermodel-public-api#714): | ||
| Both caches had no invalidation mechanism, causing stale data to persist silently across runs. | ||
| The GT cache bypassed all fixes applied to `extract_ground_truth` (FP filters, pattern additions). | ||
| The analysis cache was keyed by zip hash, so server-side idempotency key version bumps did not | ||
| bust it — the server was never reached and old results were served indefinitely. Both caches are | ||
| now removed. GT extraction is a single GitHub API call (cheap). The Supermodel API handles | ||
| server-side deduplication via the idempotency key. Also removes the `DEFAULT_GT_DIR` constant | ||
| and `ground_truth_dir` constructor parameter, which existed solely to support the caches. | ||
|
|
||
| - **Dead code benchmark: filter feature-removal false positives from ground truth** (supermodeltools/supermodel-public-api#714): | ||
| The ground truth extractor now applies the existing `_is_feature_removal_fp` filter (which | ||
| was implemented but never called). Symbols deleted in a PR that are also imported by other | ||
| files deleted in the same PR are excluded from GT — they were live code co-removed with | ||
| their consumers, not dead code. Genuinely orphaned symbols with no deleted importer are | ||
| kept. This fixes 0-recall scores for PRs like n8n #23572 and prisma #28485 where whole | ||
| files were removed as part of a feature deletion. |
There was a problem hiding this comment.
Document the result-semantics change too.
This PR also changes what Supermodel reports back to users: fp_mode can now be "vs alive set", and resolved is no longer recall-gated. That shows up in run output and result JSON, so it should have its own [Unreleased] entry alongside the cache/GT fixes.
Based on learnings "Applies to CHANGELOG.md : Update CHANGELOG.md with ALL user-visible changes: new features, bug fixes, breaking changes, deprecations, security fixes, performance improvements, and CLI/configuration changes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CHANGELOG.md` around lines 10 - 27, Add a new [Unreleased] CHANGELOG.md entry
describing the user-visible result-semantics change: note that Supermodel now
can set fp_mode to "vs alive set" and that the `resolved` field is no longer
gated by recall (i.e., it may be set regardless of recall), and indicate these
affect run output and result JSON; place this entry alongside the existing
cache/GT fixes so users see both behavioral changes and the caching removal.
| if api_key: | ||
| with tempfile.NamedTemporaryFile( | ||
| mode="w", suffix=".cfg", prefix="mcpbr_curl_", delete=False | ||
| ) as api_key_fd: | ||
| api_key_fd.write(f'header = "X-Api-Key: {api_key}"\n') | ||
| api_key_config_path = api_key_fd.name | ||
| os.chmod(api_key_config_path, 0o600) | ||
| headers.extend(["-H", f"X-Api-Key: {api_key}"]) | ||
|
|
||
| # Initial request with file upload | ||
| upload_cmd = ["curl", "-s", "-X", "POST", url, "-F", f"file=@{zip_path}", *headers] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import os
import subprocess
import time
proc = subprocess.Popen(
["python", "-c", "import time; time.sleep(5)", "X-Api-Key: demo-secret"]
)
time.sleep(1)
os.system(f"ps -o command= -p {proc.pid}")
proc.terminate()
proc.wait()
PYRepository: supermodeltools/mcpbr
Length of output: 85
🏁 Script executed:
find src -name "api_client.py" -type fRepository: supermodeltools/mcpbr
Length of output: 110
🏁 Script executed:
cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '45,70p'Repository: supermodeltools/mcpbr
Length of output: 1082
🏁 Script executed:
cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '1,65p'Repository: supermodeltools/mcpbr
Length of output: 2434
🏁 Script executed:
cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '65,100p'Repository: supermodeltools/mcpbr
Length of output: 1640
🏁 Script executed:
cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '97,130p'Repository: supermodeltools/mcpbr
Length of output: 1459
Move API key out of curl command-line arguments.
Right now, the API key gets passed to curl as a -H flag (lines 54-55, repeated at line 105 in the polling loop). That means the key ends up in the process command-line. While the subprocess is running, anything that inspects running processes—like ps, system monitoring tools, or CI/job logs—can see your secret in plain text.
Here's the simple reason why: when you call subprocess.Popen(["curl", "-s", ..., f"X-Api-Key: {api_key}"]), the API key becomes part of the process's argv. On Linux, anyone with access can read /proc/[pid]/cmdline. On any Unix system, ps aux shows it. That's a security regression.
Fix options:
- curl config file: Write headers to a temp config file with restrictive permissions, then pass
--configto curl. The key stays in the file, not argv. - stdin: Use
echo "header: value" | curl --config -to feed config from stdin. - Python HTTP client: Use
requestsorhttpxinstead of subprocess—the key never touches argv.
I'd suggest option 3 (Python HTTP client) since you're already in async context. Otherwise, option 1 (temp config file) is safer than option 2.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/supermodel/api_client.py` around lines 54 - 58, The
current code appends the API key into the curl command arguments (via the
headers list and upload_cmd) exposing it in process argv; replace the
subprocess/curl usage with a Python HTTP client so the API key is only set in an
in-memory header. Locate the places constructing headers and upload_cmd and the
polling loop that builds curl commands (references: headers, upload_cmd,
zip_path, url, and the polling logic around line ~105) and reimplement the POST
file upload and subsequent status poll using an async HTTP client (httpx or
aiohttp) or synchronous requests, setting "X-Api-Key" in the request headers
rather than any command-line, and remove all subprocess.Popen/args construction
that included the key.
Benchmark evaluate() methods can return arbitrary fields beyond the standard resolved/patch_applied/fail_to_pass set. Pass these through to the stored result dict so metrics like precision, recall, f1_score, true_positives, etc. are captured in incremental_results.jsonl. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e filter Two changes to eliminate misleading precision numbers: 1. compute_prf1 now accepts an alive_code list. When provided, false positives are pred ∩ alive_set (agent flagged something confirmed alive) rather than pred − gt_set (agent found real dead code the PR didn't happen to remove). evaluate() reads entryPoints from the workspace analysis file to populate it. 2. enhanced_prompt_v2 filter script now skips medium/low confidence candidates. All GT items across sampled tasks are high-confidence; medium/low are noise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…de label Three changes to make benchmark results credible for comparison: 1. benchmark.py: baseline contamination fix — analysis file written to hidden path (.supermodel_dead_code_analysis.json) when is_mcp=False so the baseline agent cannot see pre-computed results. evaluate() checks both paths. fp_mode field added to result dict to clarify which FP definition was used. 2. harness.py: duck-typed is_mcp — uses inspect.signature to pass is_mcp only when the benchmark's create_environment supports it, so other benchmarks remain unaffected without Protocol changes. 3. evaluation.py: fallback to standard FP when alive_set is empty — prevents trivial precision=1.0 on analysis failure (empty entryPoints returns fp=0 which makes the metric meaningless). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures the post-deploy benchmark run gets fresh dead-code analysis results from the API rather than serving cached v2 responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ruth (#714) _is_feature_removal_fp and _parse_deleted_imports were implemented but never called. extract_ground_truth now applies them to remove symbols from GT that were imported by other files also deleted in the same PR — these are feature-removal co-deletions, not dead code, so no static analysis tool would ever report them pre-merge. Genuinely orphaned symbols (no deleted importer) are correctly kept in GT and are detectable by the API when analysing the pre-merge commit. Adds 16 unit tests covering: - _parse_deleted_imports: single-line, multi-line, and default import forms - _is_feature_removal_fp: FP detection keyed to file basename to avoid spurious suppression of unrelated same-named symbols - Integration: full filter pipeline with the n8n (#23572) and prisma (#28485) benchmark cases from the issue Fixes supermodeltools/supermodel-public-api#714 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…714) Both caches had no invalidation mechanism: GT cache (~/.cache/mcpbr/supermodel_ground_truth/dead_code_*.json): - Written once, read forever; bypassed every fix applied to extract_ground_truth (FP filters added in c579a18/918e3a1, pattern additions for interface/type/enum, etc.) - n8n and prisma GT files still contained PythonSandbox/serializeDatasources as benchmark targets even after the FP filter was implemented Analysis cache (*_analysis_{zip_hash}.json): - Keyed by zip hash; server-side idempotency key bumps (v1→v2→v3) did not invalidate it — the server was never reached, old results served - n8n analysis cache showed 3,834 results with zero Pyodide hits, meaning it was computed from the post-merge commit (before the pre-merge fix) and would serve that result forever on re-runs Fix: remove both caches entirely. GT extraction is one GitHub API call. The Supermodel API handles server-side dedup via the idempotency key. Also removes DEFAULT_GT_DIR constant and ground_truth_dir constructor param (existed solely to support the now-deleted caches). Fixes supermodeltools/supermodel-public-api#714 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Completes the local-cache purge started in the previous commit. The cached_analysis field allowed a task YAML to point at a pre-fetched analysis file on disk; this path bypassed the API entirely and shared all the same stale-data problems as the zip-hash cache. Now create_environment always calls _get_analysis directly. The Supermodel API handles deduplication server-side via the idempotency key. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
db54b68 to
731cea6
Compare
Summary
Removes both local caches from
SupermodelBenchmark. They had no invalidation mechanism and silently served stale data through every fix we applied, which is the actual root cause of the 0-recall scores in supermodeltools/supermodel-public-api#714.Root cause (confirmed)
GT cache (
~/.cache/mcpbr/supermodel_ground_truth/dead_code_*.json):if gt_path.exists(): return918e3a1was bypassed for every previously cached taskdead_code_n8n_pr23572.jsonstill hadPythonSandbox(should be filtered),dead_code_prisma_pr28485.jsonstill hadserializeDatasources(same)Analysis cache (
*_analysis_{zip_hash}.json):n8n_pr23572_analysis_7e9bae86b22f.jsonhad 3,834 results with zero Pyodide/PythonSandbox hits, indicating it was computed from the post-merge commit (before the pre-merge fix ine35a179) and served that result on every subsequent runWhat this removes
DEFAULT_GT_DIRconstantground_truth_dirconstructor parameterself.gt_dirand itsmkdircall_load_ground_truth_get_analysis(includinghashlibimport)Why this is safe
gh apicall to fetch a PR diff — takes ~1s, produces correct results every timeTest plan
uv run pytest -m "not integration" --ignore=tests/test_benchmarks.py -qpassesFixes supermodeltools/supermodel-public-api#714
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Refactor