Skip to content

Dead code benchmark results — Feb 17, 2026 blog post data #2

@jonathanpopham

Description

@jonathanpopham

Context

Ran the dead code benchmark suite on feat/supermodel-benchmark branch to generate data for a blog post on AI-assisted dead code detection. This issue captures the full context, results, and observations.

Runs Completed

Run 1: Baseline Configuration (deadcode-baseline.yaml)

  • Task: typescript-express-app (synthetic corpus, 35 files, 102 dead functions)
  • MCP: Filesystem MCP with grep-based approach
  • Result: MCP 90%P/92%R/91%F1 vs Baseline 92%P/85%R/88%F1 — both PASS

Run 2: Precomputed Analysis (deadcode-precomputed.yaml)

  • Task: Same synthetic corpus
  • MCP: Pre-generated .supermodel/dead-code-analysis.json
  • Result: MCP 91%P/99%R/95%F1 vs Baseline 90%P/84%R/87%F1 — both PASS
  • Key insight: Precomputed analysis achieves 99% recall with only 5 tool calls

Run 3: Real PR (supermodel-deadcode-pr.yaml -t tyr_pr258)

  • Task: tyr_pr258uncovering-world/track-your-regions PR #258 (22 ground truth items)
  • MCP: Supermodel staging API (95.5% API recall)
  • Result: MCP 53%P/41%R/46%F1 vs Baseline 14%P/5%R/7%F1 — both FAIL

Results Summary Table

Run Task MCP P/R/F1 Baseline P/R/F1 Resolved
01 Baseline ts-express-app 90/92/91% 92/85/88% Both YES
02 Precomputed ts-express-app 91/99/95% 90/84/87% Both YES
03 Real PR tyr_pr258 53/41/46% 14/5/7% Both NO

Key Findings

1. Pre-computed analysis dramatically improves recall

Run 02 achieved 99% recall with 5 tool calls ($0.47, 1m 55s) vs baseline's 84% recall with 88 tool calls ($0.62, 3m 26s).

2. Agent-API gap on real-world codebases

The Supermodel API has 95.5% recall on tyr_pr258, but the agent only achieves 41% recall. Possible causes:

  • Analysis JSON file size (204 KB) causing truncation
  • Agent selectively filtering instead of faithfully transcribing
  • Path normalization issues between agent output and ground truth
  • Context window limits on tool output

3. Baseline agents give up on complex repos

The baseline agent on tyr_pr258 only ran 5 iterations and found 7 items (1 TP). It essentially surrendered early.

4. Scaling pattern confirmed

Consistent with earlier manual tests (antiwork/helper):

  • Synthetic (35 files): Baseline ~85% recall, MCP ~92-99% recall
  • Real-world (100+ files): Baseline <5% recall, MCP ~41% recall
  • Baseline performance degrades with size; MCP advantage grows

Cost Summary

  • Run 01: $1.26 (MCP $0.69 + BL $0.57)
  • Run 02: $1.09 (MCP $0.47 + BL $0.62)
  • Run 03: $3.41 (MCP $1.77 + BL $1.64)
  • Total: ~$5.76

Environment

  • mcpbr: v0.13.4 on feat/supermodel-benchmark
  • Model: claude-sonnet-4-20250514
  • Agent Harness: claude-code
  • Docker: v28.1.1
  • Python: 3.14.0
  • Platform: macOS Darwin 25.2.0

Artifacts Location

All run artifacts (configs, logs, ground truth, evaluation results) are saved in:
~/Downloads/blogpost-deadcode/

Includes:

  • 6 runs (01-03 without verbose, 04-06 with -vv verbose logging)
  • Reference data from earlier manual runs (antiwork/helper benchmark, dead-code-endpoint-analysis)
  • Full agent transcripts for each condition
  • Ground truth files and Supermodel API analysis caches

Open Questions

  • Investigate the agent-API gap on tyr_pr258 (41% agent recall vs 95.5% API recall)
  • Consider adding verbose logging to configs by default
  • Should the evaluation normalization be more lenient on path matching?
  • Would a prompt emphasizing "transcribe ALL candidates without filtering" improve MCP agent recall?

Related

  • Branch: feat/supermodel-benchmark
  • Earlier manual benchmark: ~/Downloads/supermodel-benchmark-results/supermodel-dead-code-benchmark.md
  • Dead code corpus: supermodeltools/dead-code-benchmark-corpus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions