Releases: jackjin1997/AgentBench-Live
Releases · jackjin1997/AgentBench-Live
Release list
v0.2.0 — Variance reporting + reframed narrative
TL;DR
Reframed the leaderboard around three reality checks the v0.1 single-trial format was hiding.
What changed
The hook: README and live leaderboard now lead with run-to-run variance instead of "who wins overall" — because re-running the same agent on the same task can swing 70 percentage points. (Claude Code on tool-001: trial 1 = 0.0, trial 2 = 0.7. Same prompt. Same Docker sandbox.)
Three findings now front and center:
- "Tied overall" hides 7× per-axis gaps. Claude Code 0.63 vs Gemini CLI 0.52 looks close, but on Tool Use, Claude is 7× better. Pick the agent for the task, not the leaderboard.
- Code tasks are commodity. Both agents score 1.00 on every code task. The interesting differences live elsewhere.
- Same agent, same task, 70-point swing. Most agent leaderboards quietly publish single-trial numbers. We don't.
Schema additions (no behavior change):
EvalScorenow carries optionalCostMetricsandLatencyMetricsfields. v0.3 will populate them.methodology.mddocuments the cost/latency roadmap and acknowledges sample-size limitations (n=2-3 for the variance findings).
Docs:
docs/findings.md— data-driven backstop for every public claimdocs/launch-copy.md— 8-channel launch copy librarydocs/launch-prep.md— HN Q&A drafts, demo video script, KOL outreach list, failure mode playbookdocs/social-card-v2.png— 1200×630 OG image, dual-color trial comparison
Tooling:
scripts/gen_launch_card.py— reproducible social card generation
What's still pending (v0.3 targets)
- Multi-trial sweep across all 4 agents × 10 tasks × ≥3 trials (Claude Code, Gemini CLI, Codex CLI, Aider)
- Disentangle agent variance from LLM-judge variance (judge-the-same-output-K-times protocol)
- Populate cost (token usage) and latency (wall-clock) axes
- Bigger task set (community task PRs welcome — see
docs/task-authoring.md)
Tests
191 passed.
Open source, MIT. Add an agent in 15 lines: see CONTRIBUTING.md.
v0.1.0 — First engineering release
What's New
- 160 tests, 90% code coverage — comprehensive test suite
- Evaluator split — AutoEvaluator, LLMJudgeEvaluator (structured output), CompositeEvaluator
- Adapter template method — add a new agent in ~15 lines of Python
- Config system — agentbench.yaml + environment variable overrides
- LangSmith integration — optional tracing, dataset export, evaluator upload
- Logging — all modules instrumented, no more silent exceptions
Benchmark Results (10 real-world tasks)
| Agent | Avg Score | Pass Rate |
|---|---|---|
| Claude Code | 0.74 | 5/10 |
| Gemini CLI | 0.52 | 3/10 |
Quick Start
pip install agentbench-live
agentbench run --agent claude-code --domain allFull changelog: https://github.com/jackjin1997/AgentBench-Live/commits/main