Releases · jackjin1997/AgentBench-Live

TL;DR

Reframed the leaderboard around three reality checks the v0.1 single-trial format was hiding.

What changed

The hook: README and live leaderboard now lead with run-to-run variance instead of "who wins overall" — because re-running the same agent on the same task can swing 70 percentage points. (Claude Code on tool-001: trial 1 = 0.0, trial 2 = 0.7. Same prompt. Same Docker sandbox.)

Three findings now front and center:

"Tied overall" hides 7× per-axis gaps. Claude Code 0.63 vs Gemini CLI 0.52 looks close, but on Tool Use, Claude is 7× better. Pick the agent for the task, not the leaderboard.
Code tasks are commodity. Both agents score 1.00 on every code task. The interesting differences live elsewhere.
Same agent, same task, 70-point swing. Most agent leaderboards quietly publish single-trial numbers. We don't.

Schema additions (no behavior change):

EvalScore now carries optional CostMetrics and LatencyMetrics fields. v0.3 will populate them.
methodology.md documents the cost/latency roadmap and acknowledges sample-size limitations (n=2-3 for the variance findings).

Docs:

docs/findings.md — data-driven backstop for every public claim
docs/launch-copy.md — 8-channel launch copy library
docs/launch-prep.md — HN Q&A drafts, demo video script, KOL outreach list, failure mode playbook
docs/social-card-v2.png — 1200×630 OG image, dual-color trial comparison

Tooling:

scripts/gen_launch_card.py — reproducible social card generation

What's still pending (v0.3 targets)

Multi-trial sweep across all 4 agents × 10 tasks × ≥3 trials (Claude Code, Gemini CLI, Codex CLI, Aider)
Disentangle agent variance from LLM-judge variance (judge-the-same-output-K-times protocol)
Populate cost (token usage) and latency (wall-clock) axes
Bigger task set (community task PRs welcome — see docs/task-authoring.md)

Tests

191 passed.

Open source, MIT. Add an agent in 15 lines: see CONTRIBUTING.md.

What's New

160 tests, 90% code coverage — comprehensive test suite

Evaluator split — AutoEvaluator, LLMJudgeEvaluator (structured output), CompositeEvaluator

Adapter template method — add a new agent in ~15 lines of Python

Config system — agentbench.yaml + environment variable overrides

LangSmith integration — optional tracing, dataset export, evaluator upload

Logging — all modules instrumented, no more silent exceptions

Benchmark Results (10 real-world tasks)

Agent	Avg Score	Pass Rate
Claude Code	0.74	5/10
Gemini CLI	0.52	3/10

Agent

Avg Score

Pass Rate

Claude Code

0.74

5/10

Gemini CLI

0.52

3/10

Quick Start

pip install agentbench-live
agentbench run --agent claude-code --domain all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release list

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TL;DR

What changed

What's still pending (v0.3 targets)

Tests

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Benchmark Results (10 real-world tasks)

Quick Start

Uh oh!

Releases: jackjin1997/AgentBench-Live

Release list

v0.2.0 — Variance reporting + reframed narrative

TL;DR

What changed

What's still pending (v0.3 targets)

Tests

Uh oh!

v0.1.0 — First engineering release

What's New

Benchmark Results (10 real-world tasks)

Quick Start

Uh oh!