Skip to content

michaelwinczuk/agentbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentBench

Framework-agnostic CLI tool for benchmarking AI agents across standardized tasks.

Features

  • Minimal adapter interface — implement a single async function: (task: string) => Promise<string>
  • 25 built-in tasks across 5 categories: tool-use, reasoning, code-gen, research, multi-step
  • 3 scoring modes — exact-match, regex-match, LLM-judge
  • Framework adapters — LangChain, CrewAI, OpenAI Assistants
  • HTML reports — standalone reports with Chart.js visualizations
  • Agent comparison — compare two agents side-by-side with diff scores

Quick Start

# Install
npm install -g @agentbench/cli

# Scaffold an adapter
agentbench init my-agent

# Run the benchmark
agentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o

# Compare two runs
agentbench compare results-a.json results-b.json

# List available tasks
agentbench tasks

Agent Adapter

The simplest possible interface — your agent just needs to be an async function:

import type { AgentAdapter } from '@agentbench/cli';

const myAgent: AgentAdapter = async (task: string): Promise<string> => {
  // Call your agent here
  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: task }],
    }),
  });
  const data = await response.json();
  return data.choices[0].message.content;
};

export default myAgent;

Framework Adapters

// LangChain
import { createLangChainAdapter } from '@agentbench/cli';
const adapter = createLangChainAdapter(myChain);

// CrewAI
import { createCrewAIAdapter } from '@agentbench/cli';
const adapter = createCrewAIAdapter(myCrew);

// OpenAI Assistants
import { createOpenAIAssistantAdapter } from '@agentbench/cli';
const adapter = createOpenAIAssistantAdapter(client, assistantId);

Task Categories

Category Tasks Description
Tool Use 4 Calculator, JSON parsing, pattern extraction, unit conversion
Reasoning 5 Logic puzzles, sequences, syllogisms, analogies, counterfactuals
Code Gen 5 FizzBuzz, palindromes, API fetch, SQL queries, algorithms
Research 5 Summarization, fact extraction, comparison, definitions
Multi-Step 6 Data pipelines, planning, text analysis, code review

Custom Tasks

Add your own tasks as YAML files:

- id: my-custom-task
  name: My Custom Task
  description: Test a specific capability
  prompt: "What is the capital of France?"
  category: reasoning
  difficulty: easy
  timeout_ms: 10000
  scoring_mode: exact-match
  expected_output: "Paris"
agentbench run -a ./my-agent.ts --task-dir ./my-tasks/

Scoring Modes

  • exact-match — Binary pass/fail against expected_output (whitespace-normalized)
  • regex-match — Binary pass/fail against expected_pattern
  • llm-judge — LLM scores 0–100 based on evaluation_criteria (falls back to heuristic if no API key)

CLI Commands

agentbench init [name]        Scaffold an adapter file and config
agentbench run                Execute benchmark against an agent
agentbench compare <a> <b>    Compare two result files side-by-side
agentbench tasks              List available benchmark tasks

agentbench run Options

Flag Description Default
-a, --adapter <path> Path to adapter module Required
-n, --name <name> Agent name "unnamed"
--framework <name> Framework identifier
--model <name> Model identifier
-c, --categories <list> Comma-separated categories to run All
-d, --difficulties <list> Comma-separated difficulties All
--concurrency <n> Max parallel tasks 3
-o, --output <dir> Output directory ./agentbench-results
--no-html Skip HTML report generation
--task-dir <path> Additional task directory
--judge-key <key> OpenAI API key for LLM judge

Leaderboard

Submit your benchmark results to the public leaderboard at aiagentdirectory.com/benchmark.

License

MIT

About

Framework-agnostic CLI tool for benchmarking AI agents across standardized tasks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors