Skip to content
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Fixed

- **Remove GT and analysis caching from SupermodelBenchmark** (supermodeltools/supermodel-public-api#714):
Both caches had no invalidation mechanism, causing stale data to persist silently across runs.
The GT cache bypassed all fixes applied to `extract_ground_truth` (FP filters, pattern additions).
The analysis cache was keyed by zip hash, so server-side idempotency key version bumps did not
bust it — the server was never reached and old results were served indefinitely. Both caches are
now removed. GT extraction is a single GitHub API call (cheap). The Supermodel API handles
server-side deduplication via the idempotency key. Also removes the `DEFAULT_GT_DIR` constant,
`ground_truth_dir` constructor parameter, and the `cached_analysis` task config field — all of
which existed solely to support the now-removed caching paths.

- **Dead code benchmark: filter feature-removal false positives from ground truth** (supermodeltools/supermodel-public-api#714):
The ground truth extractor now applies the existing `_is_feature_removal_fp` filter (which
was implemented but never called). Symbols deleted in a PR that are also imported by other
files deleted in the same PR are excluded from GT — they were live code co-removed with
their consumers, not dead code. Genuinely orphaned symbols with no deleted importer are
kept. This fixes 0-recall scores for PRs like n8n #23572 and prisma #28485 where whole
files were removed as part of a feature deletion.
Comment on lines +10 to +28
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the result-semantics change too.

This PR also changes what Supermodel reports back to users: fp_mode can now be "vs alive set", and resolved is no longer recall-gated. That shows up in run output and result JSON, so it should have its own [Unreleased] entry alongside the cache/GT fixes.

Based on learnings "Applies to CHANGELOG.md : Update CHANGELOG.md with ALL user-visible changes: new features, bug fixes, breaking changes, deprecations, security fixes, performance improvements, and CLI/configuration changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 10 - 27, Add a new [Unreleased] CHANGELOG.md entry
describing the user-visible result-semantics change: note that Supermodel now
can set fp_mode to "vs alive set" and that the `resolved` field is no longer
gated by recall (i.e., it may be set regardless of recall), and indicate these
affect run output and result JSON; place this entry alongside the existing
cache/GT fixes so users see both behavioral changes and the caching removal.


## [0.14.0] - 2026-02-13

### Added
Expand Down
43 changes: 12 additions & 31 deletions src/mcpbr/benchmarks/supermodel/api_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@
import hashlib
import json
import logging
import os
import sys
import tempfile
import time
from typing import Any

logger = logging.getLogger("mcpbr.supermodel")

Expand Down Expand Up @@ -44,29 +43,19 @@ async def call_supermodel_api(
with open(zip_path, "rb") as f:
zip_hash = hashlib.sha256(f.read()).hexdigest()[:12]
ep_name = endpoint_path.strip("/").replace("/", "-")
idempotency_key = f"bench:{ep_name}:{zip_hash}:v2"
idempotency_key = f"bench:{ep_name}:{zip_hash}:v3"

headers = [
"-H",
"Accept: application/json",
"-H",
f"Idempotency-Key: {idempotency_key}",
]

# Pass API key via curl config file to avoid exposure in process table (ps aux)
api_key_config_path: str | None = None
if api_key:
with tempfile.NamedTemporaryFile(
mode="w", suffix=".cfg", prefix="mcpbr_curl_", delete=False
) as api_key_fd:
api_key_fd.write(f'header = "X-Api-Key: {api_key}"\n')
api_key_config_path = api_key_fd.name
os.chmod(api_key_config_path, 0o600)
headers.extend(["-H", f"X-Api-Key: {api_key}"])

# Initial request with file upload
upload_cmd = ["curl", "-s", "-X", "POST", url, "-F", f"file=@{zip_path}", *headers]
Comment on lines 54 to 58
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import os
import subprocess
import time

proc = subprocess.Popen(
    ["python", "-c", "import time; time.sleep(5)", "X-Api-Key: demo-secret"]
)
time.sleep(1)
os.system(f"ps -o command= -p {proc.pid}")
proc.terminate()
proc.wait()
PY

Repository: supermodeltools/mcpbr

Length of output: 85


🏁 Script executed:

find src -name "api_client.py" -type f

Repository: supermodeltools/mcpbr

Length of output: 110


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '45,70p'

Repository: supermodeltools/mcpbr

Length of output: 1082


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '1,65p'

Repository: supermodeltools/mcpbr

Length of output: 2434


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '65,100p'

Repository: supermodeltools/mcpbr

Length of output: 1640


🏁 Script executed:

cat -n src/mcpbr/benchmarks/supermodel/api_client.py | sed -n '97,130p'

Repository: supermodeltools/mcpbr

Length of output: 1459


Move API key out of curl command-line arguments.

Right now, the API key gets passed to curl as a -H flag (lines 54-55, repeated at line 105 in the polling loop). That means the key ends up in the process command-line. While the subprocess is running, anything that inspects running processes—like ps, system monitoring tools, or CI/job logs—can see your secret in plain text.

Here's the simple reason why: when you call subprocess.Popen(["curl", "-s", ..., f"X-Api-Key: {api_key}"]), the API key becomes part of the process's argv. On Linux, anyone with access can read /proc/[pid]/cmdline. On any Unix system, ps aux shows it. That's a security regression.

Fix options:

  1. curl config file: Write headers to a temp config file with restrictive permissions, then pass --config to curl. The key stays in the file, not argv.
  2. stdin: Use echo "header: value" | curl --config - to feed config from stdin.
  3. Python HTTP client: Use requests or httpx instead of subprocess—the key never touches argv.

I'd suggest option 3 (Python HTTP client) since you're already in async context. Otherwise, option 1 (temp config file) is safer than option 2.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/supermodel/api_client.py` around lines 54 - 58, The
current code appends the API key into the curl command arguments (via the
headers list and upload_cmd) exposing it in process argv; replace the
subprocess/curl usage with a Python HTTP client so the API key is only set in an
in-memory header. Locate the places constructing headers and upload_cmd and the
polling loop that builds curl commands (references: headers, upload_cmd,
zip_path, url, and the polling logic around line ~105) and reimplement the POST
file upload and subsequent status poll using an async HTTP client (httpx or
aiohttp) or synchronous requests, setting "X-Api-Key" in the request headers
rather than any command-line, and remove all subprocess.Popen/args construction
that included the key.

if api_key_config_path:
upload_cmd.extend(["--config", api_key_config_path])

start_time = time.time()
print(
Expand All @@ -83,10 +72,7 @@ async def call_supermodel_api(
if proc.returncode != 0:
raise RuntimeError(f"Supermodel API request failed: {stderr.decode()}")

try:
response = json.loads(stdout.decode())
except json.JSONDecodeError as e:
raise RuntimeError(f"Non-JSON response from Supermodel API: {stdout.decode()[:500]}") from e
response = json.loads(stdout.decode())

# Poll if async — use lightweight requests (1-byte dummy file instead of
# re-uploading the full zip). The API recognizes the idempotency key and
Expand All @@ -102,6 +88,8 @@ async def call_supermodel_api(

# Create poll dummy on first iteration only
if poll_dummy_path is None:
import tempfile

with tempfile.NamedTemporaryFile(suffix=".zip", delete=False) as poll_dummy:
poll_dummy.write(b"\n")
poll_dummy_path = poll_dummy.name
Expand All @@ -116,8 +104,6 @@ async def call_supermodel_api(
f"file=@{poll_dummy_path}",
*headers,
]
if api_key_config_path:
poll_cmd.extend(["--config", api_key_config_path])

retry_after = response.get("retryAfter", 10)
poll_count += 1
Expand All @@ -138,17 +124,12 @@ async def call_supermodel_api(

if proc.returncode != 0:
raise RuntimeError(f"Supermodel API poll failed: {stderr.decode()}")
try:
response = json.loads(stdout.decode())
except json.JSONDecodeError as e:
raise RuntimeError(
f"Non-JSON poll response from Supermodel API: {stdout.decode()[:500]}"
) from e
response = json.loads(stdout.decode())
finally:
if poll_dummy_path is not None:
os.unlink(poll_dummy_path)
if api_key_config_path is not None:
os.unlink(api_key_config_path)
import os as _os

_os.unlink(poll_dummy_path)

elapsed = time.time() - start_time

Expand All @@ -161,6 +142,6 @@ async def call_supermodel_api(
if isinstance(status, int) and status >= 400:
raise RuntimeError(f"Supermodel API HTTP {status}: {response.get('message', response)}")

api_result = response.get("result", response)
api_result: dict[str, Any] = response.get("result", response)
print(f" Supermodel API: completed in {elapsed:.1f}s", file=sys.stderr, flush=True)
return dict(api_result)
return api_result
Loading
Loading