Skip to content

Releases: jackjin1997/AgentBench-Live

v0.2.0 — Variance reporting + reframed narrative

Choose a tag to compare

@jackjin1997 jackjin1997 released this 14 May 07:01

TL;DR

Reframed the leaderboard around three reality checks the v0.1 single-trial format was hiding.

What changed

The hook: README and live leaderboard now lead with run-to-run variance instead of "who wins overall" — because re-running the same agent on the same task can swing 70 percentage points. (Claude Code on tool-001: trial 1 = 0.0, trial 2 = 0.7. Same prompt. Same Docker sandbox.)

Three findings now front and center:

  1. "Tied overall" hides 7× per-axis gaps. Claude Code 0.63 vs Gemini CLI 0.52 looks close, but on Tool Use, Claude is 7× better. Pick the agent for the task, not the leaderboard.
  2. Code tasks are commodity. Both agents score 1.00 on every code task. The interesting differences live elsewhere.
  3. Same agent, same task, 70-point swing. Most agent leaderboards quietly publish single-trial numbers. We don't.

Schema additions (no behavior change):

  • EvalScore now carries optional CostMetrics and LatencyMetrics fields. v0.3 will populate them.
  • methodology.md documents the cost/latency roadmap and acknowledges sample-size limitations (n=2-3 for the variance findings).

Docs:

  • docs/findings.md — data-driven backstop for every public claim
  • docs/launch-copy.md — 8-channel launch copy library
  • docs/launch-prep.md — HN Q&A drafts, demo video script, KOL outreach list, failure mode playbook
  • docs/social-card-v2.png — 1200×630 OG image, dual-color trial comparison

Tooling:

  • scripts/gen_launch_card.py — reproducible social card generation

What's still pending (v0.3 targets)

  • Multi-trial sweep across all 4 agents × 10 tasks × ≥3 trials (Claude Code, Gemini CLI, Codex CLI, Aider)
  • Disentangle agent variance from LLM-judge variance (judge-the-same-output-K-times protocol)
  • Populate cost (token usage) and latency (wall-clock) axes
  • Bigger task set (community task PRs welcome — see docs/task-authoring.md)

Tests

191 passed.


Open source, MIT. Add an agent in 15 lines: see CONTRIBUTING.md.

v0.1.0 — First engineering release

Choose a tag to compare

@jackjin1997 jackjin1997 released this 19 Mar 03:51

What's New

  • 160 tests, 90% code coverage — comprehensive test suite
  • Evaluator split — AutoEvaluator, LLMJudgeEvaluator (structured output), CompositeEvaluator
  • Adapter template method — add a new agent in ~15 lines of Python
  • Config system — agentbench.yaml + environment variable overrides
  • LangSmith integration — optional tracing, dataset export, evaluator upload
  • Logging — all modules instrumented, no more silent exceptions

Benchmark Results (10 real-world tasks)

Agent Avg Score Pass Rate
Claude Code 0.74 5/10
Gemini CLI 0.52 3/10

Quick Start

pip install agentbench-live
agentbench run --agent claude-code --domain all

Full changelog: https://github.com/jackjin1997/AgentBench-Live/commits/main