Skip to content

fix(supermodel): remove GT and analysis caching — always run fresh (#714)#9

Merged
greynewell merged 7 commits intomainfrom
feat/supermodel-dead-code-fixes
Mar 26, 2026
Merged

fix(supermodel): remove GT and analysis caching — always run fresh (#714)#9
greynewell merged 7 commits intomainfrom
feat/supermodel-dead-code-fixes

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Mar 26, 2026

Summary

Removes both local caches from SupermodelBenchmark. They had no invalidation mechanism and silently served stale data through every fix we applied, which is the actual root cause of the 0-recall scores in supermodeltools/supermodel-public-api#714.

Root cause (confirmed)

GT cache (~/.cache/mcpbr/supermodel_ground_truth/dead_code_*.json):

  • Written once, read forever — line 174: if gt_path.exists(): return
  • The FP filter added in 918e3a1 was bypassed for every previously cached task
  • Confirmed: dead_code_n8n_pr23572.json still had PythonSandbox (should be filtered), dead_code_prisma_pr28485.json still had serializeDatasources (same)

Analysis cache (*_analysis_{zip_hash}.json):

  • Keyed by zip hash of the repo, not by idempotency key version
  • Bumping the server key from v1→v2→v3 forces fresh server analysis but the local cache is never checked against the key — the server is never reached
  • Confirmed: n8n_pr23572_analysis_7e9bae86b22f.json had 3,834 results with zero Pyodide/PythonSandbox hits, indicating it was computed from the post-merge commit (before the pre-merge fix in e35a179) and served that result on every subsequent run

What this removes

  • DEFAULT_GT_DIR constant
  • ground_truth_dir constructor parameter
  • self.gt_dir and its mkdir call
  • GT cache read/write in _load_ground_truth
  • Analysis cache read/write in _get_analysis (including hashlib import)
  • Stale cache files deleted from disk

Why this is safe

  • GT extraction: single gh api call to fetch a PR diff — takes ~1s, produces correct results every time
  • Analysis: Supermodel API handles deduplication server-side via the idempotency key; no local caching needed

Test plan

  • uv run pytest -m "not integration" --ignore=tests/test_benchmarks.py -q passes
  • Re-run n8n and prisma tasks — expect non-zero recall now that stale GT and stale analysis are both gone

Fixes supermodeltools/supermodel-public-api#714

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Extended dead code detection to recognize additional exported symbol types including interfaces and enum declarations.
    • Added alive code set analysis for more precise false-positive evaluation metrics.
  • Bug Fixes

    • Fixed feature-removal false-positive filtering in ground truth extraction.
    • Improved accuracy in identifying truly orphaned deleted symbols.
  • Refactor

    • Removed local ground truth and analysis caching; analysis now runs fresh with server-side deduplication.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed456d8e-2a1e-4a48-b3b4-7f348c1f9e43

📥 Commits

Reviewing files that changed from the base of the PR and between db54b68 and 731cea6.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • src/mcpbr/benchmarks/supermodel/api_client.py
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/benchmarks/supermodel/evaluation.py
  • src/mcpbr/harness.py
  • tests/test_supermodel_dead_code.py

Walkthrough

This PR removes ground truth and analysis caching from the Supermodel benchmark, shifting reliance to server-side idempotency-key deduplication. The API client is streamlined with direct API-key headers instead of temporary config files. Dead-code ground-truth extraction now applies feature-removal false-positive filtering to exclude symbols co-removed with consumers while retaining orphaned deletions. File-path handling differentiates between MCP and baseline runs, and evaluation metrics now support alive-code-based false-positive computation.

Changes

Cohort / File(s) Summary
API Client Updates
src/mcpbr/benchmarks/supermodel/api_client.py
Idempotency key bumped to v3, API key now passed directly via -H headers instead of temp config file, JSON parsing simplified by removing try/except wrapping, and return type narrowed to dict[str, Any].
Benchmark Core Logic
src/mcpbr/benchmarks/supermodel/benchmark.py
Removed GT/analysis caching (DEFAULT_GT_DIR, gt_dir, _load_ground_truth(), zip-hash caching); added is_mcp parameter to create_environment() to write analysis JSON to visible path (MCP) or hidden path (baseline); updated evaluate() to load entryPoints from analysis and pass alive_code to compute_prf1(); forced resolved=True unconditionally and added fp_mode to output.
Dead Code Endpoint
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
Expanded TypeScript export detection (added interface, type, enum); narrowed enhanced_prompt_v2 filtering to require confidence == "high"; moved feature-removal FP filtering after scope_prefix filtering.
Evaluation Metrics
src/mcpbr/benchmarks/supermodel/evaluation.py
Added optional alive_code parameter to compute_prf1(); when provided, computes false positives as overlap with alive set rather than set difference from GT; falls back to standard FP if alive set is empty.
Harness Integration
src/mcpbr/harness.py
Added runtime inspection of create_environment signature to conditionally pass is_mcp parameter; extended agent_result_to_dict() to pass-through additional attributes from eval_result into serialized output.
Tests & Documentation
tests/test_supermodel_dead_code.py, CHANGELOG.md
New test suite validating dead-code ground-truth extraction (import parsing, feature-removal FP classification); changelog entries documenting caching removal and FP filtering fix.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

No caches cluttering the way, 🚀
The server dupes what we once saved all day,
Dead symbols now filtered with care—
Feature-removed ghosts banished from the air! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removing GT and analysis caching to fix stale data issues causing 0-recall scores.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/supermodel-dead-code-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/mcpbr/harness.py (1)

277-281: Re-serialize extra eval fields before passing them through.

eval_result has already gone through dict_to_namespace(), so a nested extra metric like {"precision": {"raw": 0.9}} reaches this loop as nested SimpleNamespace objects. Copying that verbatim makes the new pass-through path non-JSON-safe the moment a benchmark returns structured extras.

♻️ Possible fix
+def _to_plain_json(value: Any) -> Any:
+    if isinstance(value, SimpleNamespace):
+        return {k: _to_plain_json(v) for k, v in vars(value).items()}
+    if isinstance(value, list):
+        return [_to_plain_json(v) for v in value]
+    return value
+
         # Pass through any extra fields (e.g. precision/recall from custom benchmarks)
         _known = {"resolved", "patch_applied", "fail_to_pass", "pass_to_pass", "error"}
         for k, v in vars(eval_result).items():
             if k not in _known and k not in data:
-                data[k] = v
+                data[k] = _to_plain_json(v)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/harness.py` around lines 277 - 281, The extra fields from
eval_result are still Namespace/SimpleNamespace objects (because
dict_to_namespace was applied) and must be converted to JSON-safe structures
before adding to data; in the loop that iterates vars(eval_result) (the block
referencing _known and for k,v in vars(eval_result).items()), replace the direct
assignment data[k] = v with a JSON-safe serialization step: detect
Namespace/SimpleNamespace (or other non-primitive types) and recursively convert
them to dict/list/primitive types (e.g. via a small helper like
namespace_to_dict or using json.dumps+json.loads with a default that calls
vars()) so nested metrics (e.g. {"precision": {"raw": 0.9}}) become plain
dicts/lists and are safe to pass through.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 10-27: Add a new [Unreleased] CHANGELOG.md entry describing the
user-visible result-semantics change: note that Supermodel now can set fp_mode
to "vs alive set" and that the `resolved` field is no longer gated by recall
(i.e., it may be set regardless of recall), and indicate these affect run output
and result JSON; place this entry alongside the existing cache/GT fixes so users
see both behavioral changes and the caching removal.

In `@src/mcpbr/benchmarks/supermodel/api_client.py`:
- Around line 54-58: The current code appends the API key into the curl command
arguments (via the headers list and upload_cmd) exposing it in process argv;
replace the subprocess/curl usage with a Python HTTP client so the API key is
only set in an in-memory header. Locate the places constructing headers and
upload_cmd and the polling loop that builds curl commands (references: headers,
upload_cmd, zip_path, url, and the polling logic around line ~105) and
reimplement the POST file upload and subsequent status poll using an async HTTP
client (httpx or aiohttp) or synchronous requests, setting "X-Api-Key" in the
request headers rather than any command-line, and remove all
subprocess.Popen/args construction that included the key.

In `@src/mcpbr/benchmarks/supermodel/benchmark.py`:
- Around line 607-627: The current evaluate() implementation reads entryPoints
from a workspace file under env.host_workdir (via analysis_path/hidden_path and
self._endpoint.analysis_filename) which is agent-writable and truncated; instead
load the full, trusted alive set from host-side state prepared in
create_environment() (e.g. store the complete entryPoints there) and use that
when computing metrics. Concretely: stop opening files under env.host_workdir in
evaluate(), remove the analysis_path/hidden_path logic, read alive_code from the
host-side attribute you populate in create_environment() (e.g.
self._analysis_data or a new self._endpoint.host_alive_entrypoints) and pass
that alive_code into compute_prf1(agent_findings, ground_truth, key_fields,
alive_code=alive_code).

In `@src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py`:
- Around line 19-21: extract_ground_truth() now emits interfaces/types/enums but
baseline_prompt still instructs agents to exclude them and
SupermodelBenchmark.create_environment() strips those candidates, causing GT vs
prompt/filter mismatch; either remove those symbols from extract_ground_truth()
or (recommended) update baseline_prompt to permit interfaces/types/enums and
stop stripping them in SupermodelBenchmark.create_environment() so GT items are
attainable — specifically, modify baseline_prompt text to remove the “Do NOT
include interface/type/enum” wording and change the filtering logic in
SupermodelBenchmark.create_environment() that currently removes
interface/type/enum candidates (and any regex/filters tied to patterns like
r"^-\s*export\s+interface\s+(\w+)" / r"^-\s*export\s+type\s+(\w+)\s*[={<]" /
r"^-\s*export\s+(?:const\s+)?enum\s+(\w+)") so those symbols are retained when
composing the MCP analysis payload.

---

Nitpick comments:
In `@src/mcpbr/harness.py`:
- Around line 277-281: The extra fields from eval_result are still
Namespace/SimpleNamespace objects (because dict_to_namespace was applied) and
must be converted to JSON-safe structures before adding to data; in the loop
that iterates vars(eval_result) (the block referencing _known and for k,v in
vars(eval_result).items()), replace the direct assignment data[k] = v with a
JSON-safe serialization step: detect Namespace/SimpleNamespace (or other
non-primitive types) and recursively convert them to dict/list/primitive types
(e.g. via a small helper like namespace_to_dict or using json.dumps+json.loads
with a default that calls vars()) so nested metrics (e.g. {"precision": {"raw":
0.9}}) become plain dicts/lists and are safe to pass through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e666fe8d-40d5-4333-8c95-cb80378d77ce

📥 Commits

Reviewing files that changed from the base of the PR and between db40a67 and 87ad956.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • src/mcpbr/benchmarks/supermodel/api_client.py
  • src/mcpbr/benchmarks/supermodel/benchmark.py
  • src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py
  • src/mcpbr/benchmarks/supermodel/evaluation.py
  • src/mcpbr/benchmarks/supermodel/git_utils.py
  • src/mcpbr/harness.py
  • tests/test_supermodel_dead_code.py

Comment on lines +10 to +27
### Fixed

- **Remove GT and analysis caching from SupermodelBenchmark** (supermodeltools/supermodel-public-api#714):
Both caches had no invalidation mechanism, causing stale data to persist silently across runs.
The GT cache bypassed all fixes applied to `extract_ground_truth` (FP filters, pattern additions).
The analysis cache was keyed by zip hash, so server-side idempotency key version bumps did not
bust it — the server was never reached and old results were served indefinitely. Both caches are
now removed. GT extraction is a single GitHub API call (cheap). The Supermodel API handles
server-side deduplication via the idempotency key. Also removes the `DEFAULT_GT_DIR` constant
and `ground_truth_dir` constructor parameter, which existed solely to support the caches.

- **Dead code benchmark: filter feature-removal false positives from ground truth** (supermodeltools/supermodel-public-api#714):
The ground truth extractor now applies the existing `_is_feature_removal_fp` filter (which
was implemented but never called). Symbols deleted in a PR that are also imported by other
files deleted in the same PR are excluded from GT — they were live code co-removed with
their consumers, not dead code. Genuinely orphaned symbols with no deleted importer are
kept. This fixes 0-recall scores for PRs like n8n #23572 and prisma #28485 where whole
files were removed as part of a feature deletion.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the result-semantics change too.

This PR also changes what Supermodel reports back to users: fp_mode can now be "vs alive set", and resolved is no longer recall-gated. That shows up in run output and result JSON, so it should have its own [Unreleased] entry alongside the cache/GT fixes.

Based on learnings "Applies to CHANGELOG.md : Update CHANGELOG.md with ALL user-visible changes: new features, bug fixes, breaking changes, deprecations, security fixes, performance improvements, and CLI/configuration changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 10 - 27, Add a new [Unreleased] CHANGELOG.md entry
describing the user-visible result-semantics change: note that Supermodel now
can set fp_mode to "vs alive set" and that the `resolved` field is no longer
gated by recall (i.e., it may be set regardless of recall), and indicate these
affect run output and result JSON; place this entry alongside the existing
cache/GT fixes so users see both behavioral changes and the caching removal.

Comment on lines 54 to 58
if api_key:
with tempfile.NamedTemporaryFile(
mode="w", suffix=".cfg", prefix="mcpbr_curl_", delete=False
) as api_key_fd:
api_key_fd.write(f'header = "X-Api-Key: {api_key}"\n')
api_key_config_path = api_key_fd.name
os.chmod(api_key_config_path, 0o600)
headers.extend(["-H", f"X-Api-Key: {api_key}"])

# Initial request with file upload
upload_cmd = ["curl", "-s", "-X", "POST", url, "-F", f"file=@{zip_path}", *headers]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import os
import subprocess
import time

proc = subprocess.Popen(
    ["python", "-c", "import time; time.sleep(5)", "X-Api-Key: demo-secret"]
)
time.sleep(1)
os.system(f"ps -o command= -p {proc.pid}")
proc.terminate()
proc.wait()
PY

Repository: supermodeltools/mcpbr

Length of output: 85


🏁 Script executed:

find src -name "api_client.py" -type f

Repository: supermodeltools/mcpbr

Length of output: 110


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '45,70p'

Repository: supermodeltools/mcpbr

Length of output: 1082


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '1,65p'

Repository: supermodeltools/mcpbr

Length of output: 2434


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '65,100p'

Repository: supermodeltools/mcpbr

Length of output: 1640


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '97,130p'

Repository: supermodeltools/mcpbr

Length of output: 1459


Move API key out of curl command-line arguments.

Right now, the API key gets passed to curl as a -H flag (lines 54-55, repeated at line 105 in the polling loop). That means the key ends up in the process command-line. While the subprocess is running, anything that inspects running processes—like ps, system monitoring tools, or CI/job logs—can see your secret in plain text.

Here's the simple reason why: when you call subprocess.Popen(["curl", "-s", ..., f"X-Api-Key: {api_key}"]), the API key becomes part of the process's argv. On Linux, anyone with access can read /proc/[pid]/cmdline. On any Unix system, ps aux shows it. That's a security regression.

Fix options:

  1. curl config file: Write headers to a temp config file with restrictive permissions, then pass --config to curl. The key stays in the file, not argv.
  2. stdin: Use echo "header: value" | curl --config - to feed config from stdin.
  3. Python HTTP client: Use requests or httpx instead of subprocess—the key never touches argv.

I'd suggest option 3 (Python HTTP client) since you're already in async context. Otherwise, option 1 (temp config file) is safer than option 2.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/api_client.py` around lines 54 - 58, The
current code appends the API key into the curl command arguments (via the
headers list and upload_cmd) exposing it in process argv; replace the
subprocess/curl usage with a Python HTTP client so the API key is only set in an
in-memory header. Locate the places constructing headers and upload_cmd and the
polling loop that builds curl commands (references: headers, upload_cmd,
zip_path, url, and the polling logic around line ~105) and reimplement the POST
file upload and subsequent status poll using an async HTTP client (httpx or
aiohttp) or synchronous requests, setting "X-Api-Key" in the request headers
rather than any command-line, and remove all subprocess.Popen/args construction
that included the key.

greynewell and others added 7 commits March 26, 2026 13:22
Benchmark evaluate() methods can return arbitrary fields beyond the
standard resolved/patch_applied/fail_to_pass set. Pass these through
to the stored result dict so metrics like precision, recall, f1_score,
true_positives, etc. are captured in incremental_results.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e filter

Two changes to eliminate misleading precision numbers:

1. compute_prf1 now accepts an alive_code list. When provided, false positives
   are pred ∩ alive_set (agent flagged something confirmed alive) rather than
   pred − gt_set (agent found real dead code the PR didn't happen to remove).
   evaluate() reads entryPoints from the workspace analysis file to populate it.

2. enhanced_prompt_v2 filter script now skips medium/low confidence candidates.
   All GT items across sampled tasks are high-confidence; medium/low are noise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…de label

Three changes to make benchmark results credible for comparison:

1. benchmark.py: baseline contamination fix — analysis file written to hidden
   path (.supermodel_dead_code_analysis.json) when is_mcp=False so the baseline
   agent cannot see pre-computed results. evaluate() checks both paths.
   fp_mode field added to result dict to clarify which FP definition was used.

2. harness.py: duck-typed is_mcp — uses inspect.signature to pass is_mcp only
   when the benchmark's create_environment supports it, so other benchmarks
   remain unaffected without Protocol changes.

3. evaluation.py: fallback to standard FP when alive_set is empty — prevents
   trivial precision=1.0 on analysis failure (empty entryPoints returns fp=0
   which makes the metric meaningless).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures the post-deploy benchmark run gets fresh dead-code analysis
results from the API rather than serving cached v2 responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ruth (#714)

_is_feature_removal_fp and _parse_deleted_imports were implemented but
never called. extract_ground_truth now applies them to remove symbols
from GT that were imported by other files also deleted in the same PR —
these are feature-removal co-deletions, not dead code, so no static
analysis tool would ever report them pre-merge.

Genuinely orphaned symbols (no deleted importer) are correctly kept in
GT and are detectable by the API when analysing the pre-merge commit.

Adds 16 unit tests covering:
- _parse_deleted_imports: single-line, multi-line, and default import forms
- _is_feature_removal_fp: FP detection keyed to file basename to avoid
  spurious suppression of unrelated same-named symbols
- Integration: full filter pipeline with the n8n (#23572) and
  prisma (#28485) benchmark cases from the issue

Fixes supermodeltools/supermodel-public-api#714

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…714)

Both caches had no invalidation mechanism:

GT cache (~/.cache/mcpbr/supermodel_ground_truth/dead_code_*.json):
- Written once, read forever; bypassed every fix applied to
  extract_ground_truth (FP filters added in c579a18/918e3a1, pattern
  additions for interface/type/enum, etc.)
- n8n and prisma GT files still contained PythonSandbox/serializeDatasources
  as benchmark targets even after the FP filter was implemented

Analysis cache (*_analysis_{zip_hash}.json):
- Keyed by zip hash; server-side idempotency key bumps (v1→v2→v3) did
  not invalidate it — the server was never reached, old results served
- n8n analysis cache showed 3,834 results with zero Pyodide hits, meaning
  it was computed from the post-merge commit (before the pre-merge fix)
  and would serve that result forever on re-runs

Fix: remove both caches entirely. GT extraction is one GitHub API call.
The Supermodel API handles server-side dedup via the idempotency key.
Also removes DEFAULT_GT_DIR constant and ground_truth_dir constructor
param (existed solely to support the now-deleted caches).

Fixes supermodeltools/supermodel-public-api#714

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Completes the local-cache purge started in the previous commit.
The cached_analysis field allowed a task YAML to point at a pre-fetched
analysis file on disk; this path bypassed the API entirely and shared
all the same stale-data problems as the zip-hash cache.

Now create_environment always calls _get_analysis directly. The Supermodel
API handles deduplication server-side via the idempotency key.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greynewell greynewell force-pushed the feat/supermodel-dead-code-fixes branch from db54b68 to 731cea6 Compare March 26, 2026 17:22
@greynewell greynewell merged commit 69fe34d into main Mar 26, 2026
12 checks passed
@greynewell greynewell deleted the feat/supermodel-dead-code-fixes branch March 26, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant