Skip to content
This repository was archived by the owner on Apr 2, 2026. It is now read-only.

Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1

Draft
jonathanpopham wants to merge 1 commit intomainfrom
feat/updated-benchmark-results
Draft

Update benchmark results: 94.1% F1, 100% precision, 156x cheaper#1
jonathanpopham wants to merge 1 commit intomainfrom
feat/updated-benchmark-results

Conversation

@jonathanpopham
Copy link
Copy Markdown
Collaborator

@jonathanpopham jonathanpopham commented Mar 31, 2026

Summary

  • Updated blog post with latest benchmark results (March 30, 2026 run with Claude Opus 4.6)
  • MCP avg F1: 94.1% vs Baseline 52.0% -- up from 10.1% F1 in the March 9 run
  • 100% precision across all 14 tasks (zero false positives)
  • 156x cheaper ($1.40 vs $219), 11x faster (28 min vs 306 min)
  • Head-to-head: MCP wins 11, Baseline wins 0, 3 ties
  • Expanded from 12 to 14 real-world tasks (added Strapi, TanStack Router, Storybook, Gemini CLI, Latitude LLM, Cal.com)
  • Updated model from Claude Sonnet 4 to Claude Opus 4.6
  • Per-task F1 table showing MCP vs Baseline across all 14 repos
  • Updated improvement trajectory table (Feb 20 -> Mar 9 -> Mar 30)
  • Revised failure modes section to reflect what's been fixed
  • Updated methodology notes with new run statistics

Key numbers

Metric MCP Baseline
Avg F1 94.1% 52.0%
Precision 100% varies
Avg Recall 90% varies
Cost $1.40 $219
Time 28 min 306 min
Tool calls 28 4,079

Test plan

  • Review all updated numbers for accuracy against raw benchmark data
  • Verify per-task F1 table matches run results
  • Check that narrative flow is coherent with new results woven in
  • Confirm no stale references to old numbers remain in the post

Summary by CodeRabbit

  • Documentation
    • Updated benchmark blog post with results from 60+ runs across 14 merged PR tasks (expanded from previous 40+ runs).
    • Updated tested model to Claude Opus 4.6.
    • Revised methodology to incorporate MCP (Model Context Protocol) integration.
    • Added new performance metrics and per-task breakdown analysis.

Updated blog post with latest benchmark results from March 30, 2026:
- MCP avg F1: 94.1% vs Baseline 52.0% (up from 10.1% in March 9 run)
- 100% precision across all 14 tasks (zero false positives)
- 90% average recall
- 156x cheaper ($1.40 vs $219), 11x faster (28 min vs 306 min)
- Head-to-head: MCP wins 11, Baseline wins 0, 3 ties
- Model upgraded to Claude Opus 4.6
- 14 real-world tasks from major OSS repos (Next.js, Cal.com, Storybook, etc.)
- MCP agent: 28 tool calls total (2/task) vs baseline 4,079
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Walkthrough

A blog post about AI agents and dead code detection was updated with new benchmark data and methodology. The model switched from Claude Sonnet 4 to Claude Opus 4.6, the analysis approach was reframed to use MCP (Model Context Protocol), and results were regenerated from an expanded 60+ task set across 14 repositories, replacing earlier metrics and journey timeline.

Changes

Cohort / File(s) Summary
Blog Post Content Update
blog/2026-03-09-dead-code-graphs-ai-agents.md
Model and approach changed (Sonnet 4 → Opus 4.6, pre-computed graphs → MCP delivery). Benchmark scope expanded (40+ runs/12 repos → 60+ runs/14 merged PR tasks). Results section heavily revised with new aggregated metrics (Avg F1 94.1%, Precision 100%, Recall 90%), updated efficiency figures, and per-task breakdown table. Benchmark methodology, journey section, and failure mode documentation all restructured with new timeline and comparison data.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

📊 Opus arrives with precision's bright gleam,
MCP channels the tool-calling dream,
Fourteen repos dance through the night,
Ninety-four F1s shining so bright! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: updated benchmark results with concrete metrics (94.1% F1, 100% precision, 156x cost reduction).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/updated-benchmark-results

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
blog/2026-03-09-dead-code-graphs-ai-agents.md (2)

263-263: Tiny wording polish: hyphenate compound modifier.

At Line 263, “false-positive root causes” reads a bit cleaner than “false positive root causes.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` at line 263, Replace the
unhyphenated compound "false positive root causes" with the hyphenated form
"false-positive root causes" in the blog post text (look for the occurrence in
the sentence discussing "100% precision and 90% recall across 14 diverse
repositories") to apply the compound modifier correctly.

327-331: Make scope differences explicit in the trajectory table.

Line 327 through Line 331 compares periods with different benchmark scope (e.g., 10 vs 14 tasks). Add a “Tasks / Dataset scope” column so readers don’t interpret this as a strict apples-to-apples trend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 327 - 331, The
trajectory table header row ("Period | Avg F1 | Avg Precision | Avg Recall | Key
Change") should add a new column "Tasks / Dataset scope" and each data row (the
Feb 20, Mar 9, Mar 30 rows) must include the corresponding scope value (e.g.,
"10 tasks" for Feb 20, explicit scope for Mar 9 or "n/a/unknown" if you don't
have the exact number, and "14 tasks" for Mar 30) so readers see
apples-to-apples differences; update the header row and each row in that
Markdown table and add a short note in the Key Change cell if needed to clarify
any differences.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Around line 131-165: Add a single explicit sentence after the metrics table
(the MCP vs Baseline summary block and likewise where the summary is reiterated
later) linking directly to the Mar 30 run artifact and the aggregation script;
for example reference the artifact path
runs/mar-30-final-run/incremental_results.jsonl and the aggregation script used
to compute the table so readers can verify the 94.1% F1, 28 tool calls, etc.
Update the markdown in blog/2026-03-09-dead-code-graphs-ai-agents.md to include
that sentence (near the metrics table and the later summary) with proper
hyperlinks to the artifact and the aggregation script in the benchmark repo.
Ensure the sentence is concise and uses the exact artifact filename
incremental_results.jsonl and the script name so it’s unambiguous.

---

Nitpick comments:
In `@blog/2026-03-09-dead-code-graphs-ai-agents.md`:
- Line 263: Replace the unhyphenated compound "false positive root causes" with
the hyphenated form "false-positive root causes" in the blog post text (look for
the occurrence in the sentence discussing "100% precision and 90% recall across
14 diverse repositories") to apply the compound modifier correctly.
- Around line 327-331: The trajectory table header row ("Period | Avg F1 | Avg
Precision | Avg Recall | Key Change") should add a new column "Tasks / Dataset
scope" and each data row (the Feb 20, Mar 9, Mar 30 rows) must include the
corresponding scope value (e.g., "10 tasks" for Feb 20, explicit scope for Mar 9
or "n/a/unknown" if you don't have the exact number, and "14 tasks" for Mar 30)
so readers see apples-to-apples differences; update the header row and each row
in that Markdown table and add a short note in the Key Change cell if needed to
clarify any differences.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8159b64f-0464-4d12-9fe7-867b4232d166

📥 Commits

Reviewing files that changed from the base of the PR and between 264cc34 and 08c5748.

📒 Files selected for processing (1)
  • blog/2026-03-09-dead-code-graphs-ai-agents.md

Comment on lines +131 to 165
| Metric | MCP (Graph) | Baseline (grep) | Difference |
|--------|------------|-----------------|------------|
| **Avg F1** | **94.1%** | 52.0% | **+42pp** |
| **Precision** | **100%** | varies | **zero false positives** |
| **Avg Recall** | **90%** | varies | |
| **Total Cost** | **$1.40** | $219 | **156x cheaper** |
| **Total Time** | **28 min** | 306 min | **11x faster** |
| **Tool Calls** | **28** (2/task) | 4,079 | **146x fewer** |
| **Head-to-head** | **11 wins** | 0 wins | 3 ties |

The graph-enhanced agent found 11 of 12 confirmed dead code items with zero false positives. The baseline found 2. Both runs reproduced identically on recall and precision. **This task was "resolved"** (both precision and recall above our 80% bar), making it the only real-world task where any agent cleared that threshold.
The MCP agent achieved 100% precision across all 14 tasks: every single item it reported was genuinely dead code. It did this while maintaining 90% average recall, finding the vast majority of confirmed dead code items in each repository.

What happened? The baseline agent spent 184 tool calls grepping through 576 files trying to build a mental model of the call graph at runtime. The graph-enhanced agent read one JSON file, wrote a small Python script to extract the candidates, and was done in 6 tool calls. **The graph pre-computes the expensive work, so the agent doesn't have to.**
The efficiency numbers are striking. The MCP agent made just 28 tool calls total across 14 tasks -- an average of 2 calls per task. It read the pre-computed analysis, reported the candidates, and was done. The baseline agent made 4,079 tool calls, spending hundreds of iterations grepping through codebases trying to build a mental model of call graphs at runtime. **The graph pre-computes the expensive work so the agent doesn't have to.**

The single missed item (`SUBSCRIPTION_FREE_TRIAL_USAGE_LIMIT`) was a constant used only in template literals. A known gap in the parser, not a limitation of the approach.
### Per-Task Breakdown

### Synthetic Codebases: Near-Perfect
Every task was a real merged PR where a developer identified and removed dead code. The MCP agent matched or exceeded the baseline on every single task.

On a synthetic 35-file TypeScript Express app with 102 planted dead code items:
| Task | Repository | MCP F1 | Baseline F1 |
|------|-----------|--------|-------------|
| tyr_pr258 | track-your-regions | **97.6%** | 81.6% |
| podman_pr16084 | Podman Desktop | **100%** | 67.7% |
| gemini_cli_pr18681 | Gemini CLI | **80%** | 42.9% |
| jslpsolver_pr159 | jsLPSolver | 78.3% | **78.6%** |
| strapi_pr24327 | Strapi | **100%** | **100%** |
| mimir_pr3613 | Mimir | **100%** | **100%** |
| otel_js_pr5444 | OpenTelemetry JS | **100%** | 17.6% |
| tanstack_router_pr6735 | TanStack Router | **100%** | 12% |
| latitude_pr2300 | Latitude LLM | **92.3%** | 35.3% |
| storybook_pr34168 | Storybook | **100%** | 0% |
| maskbook_pr12361 | Maskbook | **81%** | 68.4% |
| directus_pr26311 | Directus | **100%** | 14.3% |
| calcom_pr26222 | Cal.com | **100%** | 57.1% |
| nextjs_pr87149 | Next.js | **88.9%** | CRASH |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find candidate run artifacts for Mar 30
fd -a "incremental_results.jsonl"

# 2) Show Mar 30 candidate paths (if present)
fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" || true

# 3) If a Mar 30 JSONL exists, recompute key MCP aggregates quickly.
# Replace FILE with the discovered path before running.
FILE="$(fd -a "incremental_results.jsonl" | rg "20260330|mar-30|mar_30" | head -n1 || true)"
if [ -n "${FILE:-}" ]; then
  echo "Using: $FILE"
  python - << 'PY'
import json, statistics, sys, os
file = os.environ.get("FILE")
rows=[]
with open(file, "r", encoding="utf-8") as f:
    for line in f:
        line=line.strip()
        if not line: 
            continue
        obj=json.loads(line)
        if obj.get("type")=="task_result":
            d=obj["data"]
            rows.append(d)

mcp=[r.get("mcp") for r in rows if r.get("mcp")]
base=[r.get("baseline") for r in rows if r.get("baseline")]

def avg(xs): 
    return sum(xs)/len(xs) if xs else float("nan")

mcp_f1=avg([x["f1_score"] for x in mcp if "f1_score" in x])
mcp_prec=avg([x["precision"] for x in mcp if "precision" in x])
mcp_rec=avg([x["recall"] for x in mcp if "recall" in x])
mcp_cost=sum([x.get("cost",0) for x in mcp])
mcp_time=sum([x.get("runtime_seconds",0) for x in mcp])
mcp_calls=sum([x.get("tool_calls",0) for x in mcp])

base_f1=avg([x["f1_score"] for x in base if isinstance(x,dict) and "f1_score" in x])
base_cost=sum([x.get("cost",0) for x in base if isinstance(x,dict)])
base_time=sum([x.get("runtime_seconds",0) for x in base if isinstance(x,dict)])
base_calls=sum([x.get("tool_calls",0) for x in base if isinstance(x,dict)])

print("tasks:", len(rows))
print("mcp_avg_f1:", round(mcp_f1*100,2))
print("mcp_avg_precision:", round(mcp_prec*100,2))
print("mcp_avg_recall:", round(mcp_rec*100,2))
print("mcp_cost:", round(mcp_cost,2))
print("mcp_time_min:", round(mcp_time/60,2))
print("mcp_tool_calls:", mcp_calls)
print("baseline_avg_f1:", round(base_f1*100,2) if base else "n/a")
print("baseline_cost:", round(base_cost,2))
print("baseline_time_min:", round(base_time/60,2))
print("baseline_tool_calls:", base_calls)
PY
else
  echo "No Mar 30 JSONL path found. Add explicit run artifact links in the post."
fi

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Check overall repo structure
echo "=== Repo structure ==="
fd -a -t d -d 2 | head -20

# 2) Look for any runs/ or artifacts/ directories
echo ""
echo "=== Looking for runs or artifacts directories ==="
fd -a -t d "runs|artifacts|data|results" -d 3 | head -20

# 3) Read the blog post to see what it says about metrics
echo ""
echo "=== Blog post file exists? ==="
ls -lh blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "File not found"

# 4) Check line count
echo ""
echo "=== Blog post line count ==="
wc -l blog/2026-03-09-dead-code-graphs-ai-agents.md 2>/dev/null || echo "Cannot read"

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 552


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 5) Read lines around 131-165 to see what the post actually says
echo "=== Lines 131-165 of blog post ==="
sed -n '131,165p' blog/2026-03-09-dead-code-graphs-ai-agents.md

echo ""
echo "=== Context around lines 365-377 ==="
sed -n '360,380p' blog/2026-03-09-dead-code-graphs-ai-agents.md

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 3377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 6) Search for any mentions of run artifacts, data sources, or reproducibility in the post
echo "=== Searching for links/references to run data in post ==="
grep -i "link\|github\|artifact\|run\|reproduce\|data" blog/2026-03-09-dead-code-graphs-ai-agents.md | head -20 || echo "No matches"

# 7) Look for any files that might document how to access the run data
echo ""
echo "=== Looking for README or docs that explain the repo ==="
fd -a -i "readme|docs|guide" -t f | head -10

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 4466


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 8) List all jsonl files to see what run data exists
echo "=== All JSONL files in repo ==="
fd -a "\.jsonl$" | head -20

# 9) Check for any other benchmark result files
echo ""
echo "=== All JSON files that might be results ==="
fd -a "result|metric|benchmark" -e json | head -20

Repository: supermodeltools/dead-code-benchmark-blog

Length of output: 161


Add a direct link to the specific Mar 30 run artifact that produced these metrics.

The post mentions that "all runs are logged with timestamps, configs, and structured metrics" and links to the benchmark repository, which is great. But right now, someone reading lines 131–165 (and 365–377) would have to dig around to find which exact run produced the 94.1% F1, 28 tool calls, and other numbers in the table.

Here's what would make it bulletproof: a single sentence like:

Results from runs/mar-30-final-run/incremental_results.jsonl — computed via this aggregation script.

That way, readers don't have to guess. They can click through and verify the numbers themselves, which is especially important for a benchmark post where credibility depends on traceability. Think of it like showing your work on a math problem—the answer matters, but being able to trace it back to the source data matters more.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blog/2026-03-09-dead-code-graphs-ai-agents.md` around lines 131 - 165, Add a
single explicit sentence after the metrics table (the MCP vs Baseline summary
block and likewise where the summary is reiterated later) linking directly to
the Mar 30 run artifact and the aggregation script; for example reference the
artifact path runs/mar-30-final-run/incremental_results.jsonl and the
aggregation script used to compute the table so readers can verify the 94.1% F1,
28 tool calls, etc. Update the markdown in
blog/2026-03-09-dead-code-graphs-ai-agents.md to include that sentence (near the
metrics table and the later summary) with proper hyperlinks to the artifact and
the aggregation script in the benchmark repo. Ensure the sentence is concise and
uses the exact artifact filename incremental_results.jsonl and the script name
so it’s unambiguous.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant