Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1
Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1jonathanpopham wants to merge 1 commit intomainfrom
Conversation
Updated blog post with latest benchmark results from March 30, 2026: - MCP avg F1: 94.1% vs Baseline 52.0% (up from 10.1% in March 9 run) - 100% precision across all 14 tasks (zero false positives) - 90% average recall - 156x cheaper ($1.40 vs $219), 11x faster (28 min vs 306 min) - Head-to-head: MCP wins 11, Baseline wins 0, 3 ties - Model upgraded to Claude Opus 4.6 - 14 real-world tasks from major OSS repos (Next.js, Cal.com, Storybook, etc.) - MCP agent: 28 tool calls total (2/task) vs baseline 4,079
WalkthroughA blog post about AI agents and dead code detection was updated with new benchmark data and methodology. The model switched from Claude Sonnet 4 to Claude Opus 4.6, the analysis approach was reframed to use MCP (Model Context Protocol), and results were regenerated from an expanded 60+ task set across 14 repositories, replacing earlier metrics and journey timeline. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
blog/2026-03-09-dead-code-graphs-ai-agents.md (2)
263-263: Tiny wording polish: hyphenate compound modifier.At Line 263, “false-positive root causes” reads a bit cleaner than “false positive root causes.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` at line 263, Replace the unhyphenated compound "false positive root causes" with the hyphenated form "false-positive root causes" in the blog post text (look for the occurrence in the sentence discussing "100% precision and 90% recall across 14 diverse repositories") to apply the compound modifier correctly.
327-331: Make scope differences explicit in the trajectory table.Line 327 through Line 331 compares periods with different benchmark scope (e.g., 10 vs 14 tasks). Add a “Tasks / Dataset scope” column so readers don’t interpret this as a strict apples-to-apples trend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 327 - 331, The trajectory table header row ("Period | Avg F1 | Avg Precision | Avg Recall | Key Change") should add a new column "Tasks / Dataset scope" and each data row (the Feb 20, Mar 9, Mar 30 rows) must include the corresponding scope value (e.g., "10 tasks" for Feb 20, explicit scope for Mar 9 or "n/a/unknown" if you don't have the exact number, and "14 tasks" for Mar 30) so readers see apples-to-apples differences; update the header row and each row in that Markdown table and add a short note in the Key Change cell if needed to clarify any differences.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Around line 131-165: Add a single explicit sentence after the metrics table
(the MCP vs Baseline summary block and likewise where the summary is reiterated
later) linking directly to the Mar 30 run artifact and the aggregation script;
for example reference the artifact path
runs/mar-30-final-run/incremental_results.jsonl and the aggregation script used
to compute the table so readers can verify the 94.1% F1, 28 tool calls, etc.
Update the markdown in blog/2026-03-09-dead-code-graphs-ai-agents.md to include
that sentence (near the metrics table and the later summary) with proper
hyperlinks to the artifact and the aggregation script in the benchmark repo.
Ensure the sentence is concise and uses the exact artifact filename
incremental_results.jsonl and the script name so it’s unambiguous.
---
Nitpick comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Line 263: Replace the unhyphenated compound "false positive root causes" with
the hyphenated form "false-positive root causes" in the blog post text (look for
the occurrence in the sentence discussing "100% precision and 90% recall across
14 diverse repositories") to apply the compound modifier correctly.
- Around line 327-331: The trajectory table header row ("Period | Avg F1 | Avg
Precision | Avg Recall | Key Change") should add a new column "Tasks / Dataset
scope" and each data row (the Feb 20, Mar 9, Mar 30 rows) must include the
corresponding scope value (e.g., "10 tasks" for Feb 20, explicit scope for Mar 9
or "n/a/unknown" if you don't have the exact number, and "14 tasks" for Mar 30)
so readers see apples-to-apples differences; update the header row and each row
in that Markdown table and add a short note in the Key Change cell if needed to
clarify any differences.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8159b64f-0464-4d12-9fe7-867b4232d166
📒 Files selected for processing (1)
blog/2026-03-09-dead-code-graphs-ai-agents.md
| | Metric | MCP (Graph) | Baseline (grep) | Difference | | ||
| |--------|------------|-----------------|------------| | ||
| | **Avg F1** | **94.1%** | 52.0% | **+42pp** | | ||
| | **Precision** | **100%** | varies | **zero false positives** | | ||
| | **Avg Recall** | **90%** | varies | | | ||
| | **Total Cost** | **$1.40** | $219 | **156x cheaper** | | ||
| | **Total Time** | **28 min** | 306 min | **11x faster** | | ||
| | **Tool Calls** | **28** (2/task) | 4,079 | **146x fewer** | | ||
| | **Head-to-head** | **11 wins** | 0 wins | 3 ties | | ||
|
|
||
| The graph-enhanced agent found 11 of 12 confirmed dead code items with zero false positives. The baseline found 2. Both runs reproduced identically on recall and precision. **This task was "resolved"** (both precision and recall above our 80% bar), making it the only real-world task where any agent cleared that threshold. | ||
| The MCP agent achieved 100% precision across all 14 tasks: every single item it reported was genuinely dead code. It did this while maintaining 90% average recall, finding the vast majority of confirmed dead code items in each repository. | ||
|
|
||
| What happened? The baseline agent spent 184 tool calls grepping through 576 files trying to build a mental model of the call graph at runtime. The graph-enhanced agent read one JSON file, wrote a small Python script to extract the candidates, and was done in 6 tool calls. **The graph pre-computes the expensive work, so the agent doesn't have to.** | ||
| The efficiency numbers are striking. The MCP agent made just 28 tool calls total across 14 tasks -- an average of 2 calls per task. It read the pre-computed analysis, reported the candidates, and was done. The baseline agent made 4,079 tool calls, spending hundreds of iterations grepping through codebases trying to build a mental model of call graphs at runtime. **The graph pre-computes the expensive work so the agent doesn't have to.** | ||
|
|
||
| The single missed item (`SUBSCRIPTION_FREE_TRIAL_USAGE_LIMIT`) was a constant used only in template literals. A known gap in the parser, not a limitation of the approach. | ||
| ### Per-Task Breakdown | ||
|
|
||
| ### Synthetic Codebases: Near-Perfect | ||
| Every task was a real merged PR where a developer identified and removed dead code. The MCP agent matched or exceeded the baseline on every single task. | ||
|
|
||
| On a synthetic 35-file TypeScript Express app with 102 planted dead code items: | ||
| | Task | Repository | MCP F1 | Baseline F1 | | ||
| |------|-----------|--------|-------------| | ||
| | tyr_pr258 | track-your-regions | **97.6%** | 81.6% | | ||
| | podman_pr16084 | Podman Desktop | **100%** | 67.7% | | ||
| | gemini_cli_pr18681 | Gemini CLI | **80%** | 42.9% | | ||
| | jslpsolver_pr159 | jsLPSolver | 78.3% | **78.6%** | | ||
| | strapi_pr24327 | Strapi | **100%** | **100%** | | ||
| | mimir_pr3613 | Mimir | **100%** | **100%** | | ||
| | otel_js_pr5444 | OpenTelemetry JS | **100%** | 17.6% | | ||
| | tanstack_router_pr6735 | TanStack Router | **100%** | 12% | | ||
| | latitude_pr2300 | Latitude LLM | **92.3%** | 35.3% | | ||
| | storybook_pr34168 | Storybook | **100%** | 0% | | ||
| | maskbook_pr12361 | Maskbook | **81%** | 68.4% | | ||
| | directus_pr26311 | Directus | **100%** | 14.3% | | ||
| | calcom_pr26222 | Cal.com | **100%** | 57.1% | | ||
| | nextjs_pr87149 | Next.js | **88.9%** | CRASH | | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find candidate run artifacts for Mar 30
fd -a "incremental_results.jsonl"
# 2) Show Mar 30 candidate paths (if present)
fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" || true
# 3) If a Mar 30 JSONL exists, recompute key MCP aggregates quickly.
# Replace FILE with the discovered path before running.
FILE="$(fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" | head -n1 || true)"
if [ -n "${FILE:-}" ]; then
echo "Using: $FILE"
python - << 'PY'
import json, statistics, sys, os
file = os.environ.get("FILE")
rows=[]
with open(file, "r", encoding="utf-8") as f:
for line in f:
line=line.strip()
if not line:
continue
obj=json.loads(line)
if obj.get("type")=="task_result":
d=obj["data"]
rows.append(d)
mcp=[r.get("mcp") for r in rows if r.get("mcp")]
base=[r.get("baseline") for r in rows if r.get("baseline")]
def avg(xs):
return sum(xs)/len(xs) if xs else float("nan")
mcp_f1=avg([x["f1_score"] for x in mcp if "f1_score" in x])
mcp_prec=avg([x["precision"] for x in mcp if "precision" in x])
mcp_rec=avg([x["recall"] for x in mcp if "recall" in x])
mcp_cost=sum([x.get("cost",0) for x in mcp])
mcp_time=sum([x.get("runtime_seconds",0) for x in mcp])
mcp_calls=sum([x.get("tool_calls",0) for x in mcp])
base_f1=avg([x["f1_score"] for x in base if isinstance(x,dict) and "f1_score" in x])
base_cost=sum([x.get("cost",0) for x in base if isinstance(x,dict)])
base_time=sum([x.get("runtime_seconds",0) for x in base if isinstance(x,dict)])
base_calls=sum([x.get("tool_calls",0) for x in base if isinstance(x,dict)])
print("tasks:", len(rows))
print("mcp_avg_f1:", round(mcp_f1*100,2))
print("mcp_avg_precision:", round(mcp_prec*100,2))
print("mcp_avg_recall:", round(mcp_rec*100,2))
print("mcp_cost:", round(mcp_cost,2))
print("mcp_time_min:", round(mcp_time/60,2))
print("mcp_tool_calls:", mcp_calls)
print("baseline_avg_f1:", round(base_f1*100,2) if base else "n/a")
print("baseline_cost:", round(base_cost,2))
print("baseline_time_min:", round(base_time/60,2))
print("baseline_tool_calls:", base_calls)
PY
else
echo "No Mar 30 JSONL path found. Add explicit run artifact links in the post."
fiRepository: supermodeltools/dead-code-benchmark-blog
Length of output: 156
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Check overall repo structure
echo "=== Repo structure ==="
fd -a -t d -d 2 | head -20
# 2) Look for any runs/ or artifacts/ directories
echo ""
echo "=== Looking for runs or artifacts directories ==="
fd -a -t d "runs|artifacts|data|results" -d 3 | head -20
# 3) Read the blog post to see what it says about metrics
echo ""
echo "=== Blog post file exists? ==="
ls -lh blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "File not found"
# 4) Check line count
echo ""
echo "=== Blog post line count ==="
wc -l blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "Cannot read"Repository: supermodeltools/dead-code-benchmark-blog
Length of output: 552
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 5) Read lines around 131-165 to see what the post actually says
echo "=== Lines 131-165 of blog post ==="
sed -n '131,165p' blog/2026-03-09-dead-code-graphs-ai-agents.md
echo ""
echo "=== Context around lines 365-377 ==="
sed -n '360,380p' blog/2026-03-09-dead-code-graphs-ai-agents.mdRepository: supermodeltools/dead-code-benchmark-blog
Length of output: 3377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 6) Search for any mentions of run artifacts, data sources, or reproducibility in the post
echo "=== Searching for links/references to run data in post ==="
grep -i "link\|github\|artifact\|run\|reproduce\|data" blog/2026-03-09-dead-code-graphs-ai-agents.md | head -20 || echo "No matches"
# 7) Look for any files that might document how to access the run data
echo ""
echo "=== Looking for README or docs that explain the repo ==="
fd -a -i "readme|docs|guide" -t f | head -10Repository: supermodeltools/dead-code-benchmark-blog
Length of output: 4466
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 8) List all jsonl files to see what run data exists
echo "=== All JSONL files in repo ==="
fd -a "\.jsonl$" | head -20
# 9) Check for any other benchmark result files
echo ""
echo "=== All JSON files that might be results ==="
fd -a "result|metric|benchmark" -e json | head -20Repository: supermodeltools/dead-code-benchmark-blog
Length of output: 161
Add a direct link to the specific Mar 30 run artifact that produced these metrics.
The post mentions that "all runs are logged with timestamps, configs, and structured metrics" and links to the benchmark repository, which is great. But right now, someone reading lines 131–165 (and 365–377) would have to dig around to find which exact run produced the 94.1% F1, 28 tool calls, and other numbers in the table.
Here's what would make it bulletproof: a single sentence like:
Results from runs/mar-30-final-run/incremental_results.jsonl — computed via this aggregation script.
That way, readers don't have to guess. They can click through and verify the numbers themselves, which is especially important for a benchmark post where credibility depends on traceability. Think of it like showing your work on a math problem—the answer matters, but being able to trace it back to the source data matters more.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 131 - 165, Add a
single explicit sentence after the metrics table (the MCP vs Baseline summary
block and likewise where the summary is reiterated later) linking directly to
the Mar 30 run artifact and the aggregation script; for example reference the
artifact path runs/mar-30-final-run/incremental_results.jsonl and the
aggregation script used to compute the table so readers can verify the 94.1% F1,
28 tool calls, etc. Update the markdown in
blog/2026-03-09-dead-code-graphs-ai-agents.md to include that sentence (near the
metrics table and the later summary) with proper hyperlinks to the artifact and
the aggregation script in the benchmark repo. Ensure the sentence is concise and
uses the exact artifact filename incremental_results.jsonl and the script name
so it’s unambiguous.
Summary
Key numbers
Test plan
Summary by CodeRabbit