-
Notifications
You must be signed in to change notification settings - Fork 2
Dead code benchmark results — Feb 17, 2026 blog post data #2
Copy link
Copy link
Open
Description
Context
Ran the dead code benchmark suite on feat/supermodel-benchmark branch to generate data for a blog post on AI-assisted dead code detection. This issue captures the full context, results, and observations.
Runs Completed
Run 1: Baseline Configuration (deadcode-baseline.yaml)
- Task:
typescript-express-app(synthetic corpus, 35 files, 102 dead functions) - MCP: Filesystem MCP with grep-based approach
- Result: MCP 90%P/92%R/91%F1 vs Baseline 92%P/85%R/88%F1 — both PASS
Run 2: Precomputed Analysis (deadcode-precomputed.yaml)
- Task: Same synthetic corpus
- MCP: Pre-generated
.supermodel/dead-code-analysis.json - Result: MCP 91%P/99%R/95%F1 vs Baseline 90%P/84%R/87%F1 — both PASS
- Key insight: Precomputed analysis achieves 99% recall with only 5 tool calls
Run 3: Real PR (supermodel-deadcode-pr.yaml -t tyr_pr258)
- Task:
tyr_pr258—uncovering-world/track-your-regionsPR #258 (22 ground truth items) - MCP: Supermodel staging API (95.5% API recall)
- Result: MCP 53%P/41%R/46%F1 vs Baseline 14%P/5%R/7%F1 — both FAIL
Results Summary Table
| Run | Task | MCP P/R/F1 | Baseline P/R/F1 | Resolved |
|---|---|---|---|---|
| 01 Baseline | ts-express-app | 90/92/91% | 92/85/88% | Both YES |
| 02 Precomputed | ts-express-app | 91/99/95% | 90/84/87% | Both YES |
| 03 Real PR | tyr_pr258 | 53/41/46% | 14/5/7% | Both NO |
Key Findings
1. Pre-computed analysis dramatically improves recall
Run 02 achieved 99% recall with 5 tool calls ($0.47, 1m 55s) vs baseline's 84% recall with 88 tool calls ($0.62, 3m 26s).
2. Agent-API gap on real-world codebases
The Supermodel API has 95.5% recall on tyr_pr258, but the agent only achieves 41% recall. Possible causes:
- Analysis JSON file size (204 KB) causing truncation
- Agent selectively filtering instead of faithfully transcribing
- Path normalization issues between agent output and ground truth
- Context window limits on tool output
3. Baseline agents give up on complex repos
The baseline agent on tyr_pr258 only ran 5 iterations and found 7 items (1 TP). It essentially surrendered early.
4. Scaling pattern confirmed
Consistent with earlier manual tests (antiwork/helper):
- Synthetic (35 files): Baseline ~85% recall, MCP ~92-99% recall
- Real-world (100+ files): Baseline <5% recall, MCP ~41% recall
- Baseline performance degrades with size; MCP advantage grows
Cost Summary
- Run 01: $1.26 (MCP $0.69 + BL $0.57)
- Run 02: $1.09 (MCP $0.47 + BL $0.62)
- Run 03: $3.41 (MCP $1.77 + BL $1.64)
- Total: ~$5.76
Environment
- mcpbr: v0.13.4 on
feat/supermodel-benchmark - Model:
claude-sonnet-4-20250514 - Agent Harness:
claude-code - Docker: v28.1.1
- Python: 3.14.0
- Platform: macOS Darwin 25.2.0
Artifacts Location
All run artifacts (configs, logs, ground truth, evaluation results) are saved in:
~/Downloads/blogpost-deadcode/
Includes:
- 6 runs (01-03 without verbose, 04-06 with
-vvverbose logging) - Reference data from earlier manual runs (antiwork/helper benchmark, dead-code-endpoint-analysis)
- Full agent transcripts for each condition
- Ground truth files and Supermodel API analysis caches
Open Questions
- Investigate the agent-API gap on tyr_pr258 (41% agent recall vs 95.5% API recall)
- Consider adding verbose logging to configs by default
- Should the evaluation normalization be more lenient on path matching?
- Would a prompt emphasizing "transcribe ALL candidates without filtering" improve MCP agent recall?
Related
- Branch:
feat/supermodel-benchmark - Earlier manual benchmark:
~/Downloads/supermodel-benchmark-results/supermodel-dead-code-benchmark.md - Dead code corpus:
supermodeltools/dead-code-benchmark-corpus
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels