Update benchmark results: 94.1% F1, 100% precision, 156x cheaper by jonathanpopham · Pull Request #1 · supermodeltools/dead-code-benchmark-blog

jonathanpopham · 2026-03-31T01:53:37Z

Summary

Updated blog post with latest benchmark results (March 30, 2026 run with Claude Opus 4.6)
MCP avg F1: 94.1% vs Baseline 52.0% -- up from 10.1% F1 in the March 9 run
100% precision across all 14 tasks (zero false positives)
156x cheaper ($1.40 vs $219), 11x faster (28 min vs 306 min)
Head-to-head: MCP wins 11, Baseline wins 0, 3 ties
Expanded from 12 to 14 real-world tasks (added Strapi, TanStack Router, Storybook, Gemini CLI, Latitude LLM, Cal.com)
Updated model from Claude Sonnet 4 to Claude Opus 4.6
Per-task F1 table showing MCP vs Baseline across all 14 repos
Updated improvement trajectory table (Feb 20 -> Mar 9 -> Mar 30)
Revised failure modes section to reflect what's been fixed
Updated methodology notes with new run statistics

Key numbers

Metric	MCP	Baseline
Avg F1	94.1%	52.0%
Precision	100%	varies
Avg Recall	90%	varies
Cost	$1.40	$219
Time	28 min	306 min
Tool calls	28	4,079

Test plan

Review all updated numbers for accuracy against raw benchmark data
Verify per-task F1 table matches run results
Check that narrative flow is coherent with new results woven in
Confirm no stale references to old numbers remain in the post

Summary by CodeRabbit

Documentation
- Updated benchmark blog post with results from 60+ runs across 14 merged PR tasks (expanded from previous 40+ runs).
- Updated tested model to Claude Opus 4.6.
- Revised methodology to incorporate MCP (Model Context Protocol) integration.
- Added new performance metrics and per-task breakdown analysis.

Updated blog post with latest benchmark results from March 30, 2026: - MCP avg F1: 94.1% vs Baseline 52.0% (up from 10.1% in March 9 run) - 100% precision across all 14 tasks (zero false positives) - 90% average recall - 156x cheaper ($1.40 vs $219), 11x faster (28 min vs 306 min) - Head-to-head: MCP wins 11, Baseline wins 0, 3 ties - Model upgraded to Claude Opus 4.6 - 14 real-world tasks from major OSS repos (Next.js, Cal.com, Storybook, etc.) - MCP agent: 28 tool calls total (2/task) vs baseline 4,079

coderabbitai · 2026-03-31T01:53:57Z

Walkthrough

A blog post about AI agents and dead code detection was updated with new benchmark data and methodology. The model switched from Claude Sonnet 4 to Claude Opus 4.6, the analysis approach was reframed to use MCP (Model Context Protocol), and results were regenerated from an expanded 60+ task set across 14 repositories, replacing earlier metrics and journey timeline.

Changes

Cohort / File(s)	Summary
Blog Post Content Update `blog/2026-03-09-dead-code-graphs-ai-agents.md`	Model and approach changed (Sonnet 4 → Opus 4.6, pre-computed graphs → MCP delivery). Benchmark scope expanded (40+ runs/12 repos → 60+ runs/14 merged PR tasks). Results section heavily revised with new aggregated metrics (Avg F1 94.1%, Precision 100%, Recall 90%), updated efficiency figures, and per-task breakdown table. Benchmark methodology, journey section, and failure mode documentation all restructured with new timeline and comparison data.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

📊 Opus arrives with precision's bright gleam,
MCP channels the tool-calling dream,
Fourteen repos dance through the night,
Ninety-four F1s shining so bright! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: updated benchmark results with concrete metrics (94.1% F1, 100% precision, 156x cost reduction).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/updated-benchmark-results

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

blog/2026-03-09-dead-code-graphs-ai-agents.md (2)
263-263: Tiny wording polish: hyphenate compound modifier.

At Line 263, “false-positive root causes” reads a bit cleaner than “false positive root causes.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` at line 263, Replace the
unhyphenated compound "false positive root causes" with the hyphenated form
"false-positive root causes" in the blog post text (look for the occurrence in
the sentence discussing "100% precision and 90% recall across 14 diverse
repositories") to apply the compound modifier correctly.
327-331: Make scope differences explicit in the trajectory table.

Line 327 through Line 331 compares periods with different benchmark scope (e.g., 10 vs 14 tasks). Add a “Tasks / Dataset scope” column so readers don’t interpret this as a strict apples-to-apples trend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 327 - 331, The
trajectory table header row ("Period | Avg F1 | Avg Precision | Avg Recall | Key
Change") should add a new column "Tasks / Dataset scope" and each data row (the
Feb 20, Mar 9, Mar 30 rows) must include the corresponding scope value (e.g.,
"10 tasks" for Feb 20, explicit scope for Mar 9 or "n/a/unknown" if you don't
have the exact number, and "14 tasks" for Mar 30) so readers see
apples-to-apples differences; update the header row and each row in that
Markdown table and add a short note in the Key Change cell if needed to clarify
any differences.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Around line 131-165: Add a single explicit sentence after the metrics table
(the MCP vs Baseline summary block and likewise where the summary is reiterated
later) linking directly to the Mar 30 run artifact and the aggregation script;
for example reference the artifact path
runs/mar-30-final-run/incremental_results.jsonl and the aggregation script used
to compute the table so readers can verify the 94.1% F1, 28 tool calls, etc.
Update the markdown in blog/2026-03-09-dead-code-graphs-ai-agents.md to include
that sentence (near the metrics table and the later summary) with proper
hyperlinks to the artifact and the aggregation script in the benchmark repo.
Ensure the sentence is concise and uses the exact artifact filename
incremental_results.jsonl and the script name so it’s unambiguous.

---

Nitpick comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Line 263: Replace the unhyphenated compound "false positive root causes" with
the hyphenated form "false-positive root causes" in the blog post text (look for
the occurrence in the sentence discussing "100% precision and 90% recall across
14 diverse repositories") to apply the compound modifier correctly.
- Around line 327-331: The trajectory table header row ("Period | Avg F1 | Avg
Precision | Avg Recall | Key Change") should add a new column "Tasks / Dataset
scope" and each data row (the Feb 20, Mar 9, Mar 30 rows) must include the
corresponding scope value (e.g., "10 tasks" for Feb 20, explicit scope for Mar 9
or "n/a/unknown" if you don't have the exact number, and "14 tasks" for Mar 30)
so readers see apples-to-apples differences; update the header row and each row
in that Markdown table and add a short note in the Key Change cell if needed to
clarify any differences.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8159b64f-0464-4d12-9fe7-867b4232d166

📥 Commits

Reviewing files that changed from the base of the PR and between 264cc34 and 08c5748.

📒 Files selected for processing (1)

blog/2026-03-09-dead-code-graphs-ai-agents.md

coderabbitai · 2026-03-31T01:57:52Z

blog/2026-03-09-dead-code-graphs-ai-agents.md

+| Metric | MCP (Graph) | Baseline (grep) | Difference |
+|--------|------------|-----------------|------------|
+| **Avg F1** | **94.1%** | 52.0% | **+42pp** |
+| **Precision** | **100%** | varies | **zero false positives** |
+| **Avg Recall** | **90%** | varies | |
+| **Total Cost** | **$1.40** | $219 | **156x cheaper** |
+| **Total Time** | **28 min** | 306 min | **11x faster** |
+| **Tool Calls** | **28** (2/task) | 4,079 | **146x fewer** |
+| **Head-to-head** | **11 wins** | 0 wins | 3 ties |

-The graph-enhanced agent found 11 of 12 confirmed dead code items with zero false positives. The baseline found 2. Both runs reproduced identically on recall and precision. **This task was "resolved"** (both precision and recall above our 80% bar), making it the only real-world task where any agent cleared that threshold.
+The MCP agent achieved 100% precision across all 14 tasks: every single item it reported was genuinely dead code. It did this while maintaining 90% average recall, finding the vast majority of confirmed dead code items in each repository.

-What happened? The baseline agent spent 184 tool calls grepping through 576 files trying to build a mental model of the call graph at runtime. The graph-enhanced agent read one JSON file, wrote a small Python script to extract the candidates, and was done in 6 tool calls. **The graph pre-computes the expensive work, so the agent doesn't have to.**
+The efficiency numbers are striking. The MCP agent made just 28 tool calls total across 14 tasks -- an average of 2 calls per task. It read the pre-computed analysis, reported the candidates, and was done. The baseline agent made 4,079 tool calls, spending hundreds of iterations grepping through codebases trying to build a mental model of call graphs at runtime. **The graph pre-computes the expensive work so the agent doesn't have to.**

-The single missed item (`SUBSCRIPTION_FREE_TRIAL_USAGE_LIMIT`) was a constant used only in template literals. A known gap in the parser, not a limitation of the approach.
+### Per-Task Breakdown

-### Synthetic Codebases: Near-Perfect
+Every task was a real merged PR where a developer identified and removed dead code. The MCP agent matched or exceeded the baseline on every single task.

-On a synthetic 35-file TypeScript Express app with 102 planted dead code items:
+| Task | Repository | MCP F1 | Baseline F1 |
+|------|-----------|--------|-------------|
+| tyr_pr258 | track-your-regions | **97.6%** | 81.6% |
+| podman_pr16084 | Podman Desktop | **100%** | 67.7% |
+| gemini_cli_pr18681 | Gemini CLI | **80%** | 42.9% |
+| jslpsolver_pr159 | jsLPSolver | 78.3% | **78.6%** |
+| strapi_pr24327 | Strapi | **100%** | **100%** |
+| mimir_pr3613 | Mimir | **100%** | **100%** |
+| otel_js_pr5444 | OpenTelemetry JS | **100%** | 17.6% |
+| tanstack_router_pr6735 | TanStack Router | **100%** | 12% |
+| latitude_pr2300 | Latitude LLM | **92.3%** | 35.3% |
+| storybook_pr34168 | Storybook | **100%** | 0% |
+| maskbook_pr12361 | Maskbook | **81%** | 68.4% |
+| directus_pr26311 | Directus | **100%** | 14.3% |
+| calcom_pr26222 | Cal.com | **100%** | 57.1% |
+| nextjs_pr87149 | Next.js | **88.9%** | CRASH |



⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find candidate run artifacts for Mar 30 fd -a "incremental_results.jsonl" # 2) Show Mar 30 candidate paths (if present) fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" || true # 3) If a Mar 30 JSONL exists, recompute key MCP aggregates quickly. # Replace FILE with the discovered path before running. FILE="$(fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" | head -n1 || true)" if [ -n "${FILE:-}" ]; then echo "Using: $FILE" python - << 'PY' import json, statistics, sys, os file = os.environ.get("FILE") rows=[] with open(file, "r", encoding="utf-8") as f: for line in f: line=line.strip() if not line: continue obj=json.loads(line) if obj.get("type")=="task_result": d=obj["data"] rows.append(d) mcp=[r.get("mcp") for r in rows if r.get("mcp")] base=[r.get("baseline") for r in rows if r.get("baseline")] def avg(xs): return sum(xs)/len(xs) if xs else float("nan") mcp_f1=avg([x["f1_score"] for x in mcp if "f1_score" in x]) mcp_prec=avg([x["precision"] for x in mcp if "precision" in x]) mcp_rec=avg([x["recall"] for x in mcp if "recall" in x]) mcp_cost=sum([x.get("cost",0) for x in mcp]) mcp_time=sum([x.get("runtime_seconds",0) for x in mcp]) mcp_calls=sum([x.get("tool_calls",0) for x in mcp]) base_f1=avg([x["f1_score"] for x in base if isinstance(x,dict) and "f1_score" in x]) base_cost=sum([x.get("cost",0) for x in base if isinstance(x,dict)]) base_time=sum([x.get("runtime_seconds",0) for x in base if isinstance(x,dict)]) base_calls=sum([x.get("tool_calls",0) for x in base if isinstance(x,dict)]) print("tasks:", len(rows)) print("mcp_avg_f1:", round(mcp_f1*100,2)) print("mcp_avg_precision:", round(mcp_prec*100,2)) print("mcp_avg_recall:", round(mcp_rec*100,2)) print("mcp_cost:", round(mcp_cost,2)) print("mcp_time_min:", round(mcp_time/60,2)) print("mcp_tool_calls:", mcp_calls) print("baseline_avg_f1:", round(base_f1*100,2) if base else "n/a") print("baseline_cost:", round(base_cost,2)) print("baseline_time_min:", round(base_time/60,2)) print("baseline_tool_calls:", base_calls) PY else echo "No Mar 30 JSONL path found. Add explicit run artifact links in the post." fi

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 156

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Check overall repo structure echo "=== Repo structure ===" fd -a -t d -d 2 | head -20 # 2) Look for any runs/ or artifacts/ directories echo "" echo "=== Looking for runs or artifacts directories ===" fd -a -t d "runs|artifacts|data|results" -d 3 | head -20 # 3) Read the blog post to see what it says about metrics echo "" echo "=== Blog post file exists? ===" ls -lh blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "File not found" # 4) Check line count echo "" echo "=== Blog post line count ===" wc -l blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "Cannot read"

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 552

🏁 Script executed:

#!/bin/bash set -euo pipefail # 5) Read lines around 131-165 to see what the post actually says echo "=== Lines 131-165 of blog post ===" sed -n '131,165p' blog/2026-03-09-dead-code-graphs-ai-agents.md echo "" echo "=== Context around lines 365-377 ===" sed -n '360,380p' blog/2026-03-09-dead-code-graphs-ai-agents.md

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 3377

🏁 Script executed:

#!/bin/bash set -euo pipefail # 6) Search for any mentions of run artifacts, data sources, or reproducibility in the post echo "=== Searching for links/references to run data in post ===" grep -i "link\|github\|artifact\|run\|reproduce\|data" blog/2026-03-09-dead-code-graphs-ai-agents.md | head -20 || echo "No matches" # 7) Look for any files that might document how to access the run data echo "" echo "=== Looking for README or docs that explain the repo ===" fd -a -i "readme|docs|guide" -t f | head -10

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 4466

🏁 Script executed:

#!/bin/bash set -euo pipefail # 8) List all jsonl files to see what run data exists echo "=== All JSONL files in repo ===" fd -a "\.jsonl$" | head -20 # 9) Check for any other benchmark result files echo "" echo "=== All JSON files that might be results ===" fd -a "result|metric|benchmark" -e json | head -20

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 161

Add a direct link to the specific Mar 30 run artifact that produced these metrics.

The post mentions that "all runs are logged with timestamps, configs, and structured metrics" and links to the benchmark repository, which is great. But right now, someone reading lines 131–165 (and 365–377) would have to dig around to find which exact run produced the 94.1% F1, 28 tool calls, and other numbers in the table.

Here's what would make it bulletproof: a single sentence like:

Results from runs/mar-30-final-run/incremental_results.jsonl — computed via this aggregation script.

That way, readers don't have to guess. They can click through and verify the numbers themselves, which is especially important for a benchmark post where credibility depends on traceability. Think of it like showing your work on a math problem—the answer matters, but being able to trace it back to the source data matters more.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 131 - 165, Add a single explicit sentence after the metrics table (the MCP vs Baseline summary block and likewise where the summary is reiterated later) linking directly to the Mar 30 run artifact and the aggregation script; for example reference the artifact path runs/mar-30-final-run/incremental_results.jsonl and the aggregation script used to compute the table so readers can verify the 94.1% F1, 28 tool calls, etc. Update the markdown in blog/2026-03-09-dead-code-graphs-ai-agents.md to include that sentence (near the metrics table and the later summary) with proper hyperlinks to the artifact and the aggregation script in the benchmark repo. Ensure the sentence is concise and uses the exact artifact filename incremental_results.jsonl and the script name so it’s unambiguous.

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1

Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1
jonathanpopham wants to merge 1 commit intomainfrom
feat/updated-benchmark-results

jonathanpopham commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonathanpopham commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key numbers

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonathanpopham commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading