Skip to content

OpenHands/benchmarks

Repository files navigation

OpenHands Benchmarks

This repository contains benchmark evaluation infrastructure for OpenHands agents. It provides standardized evaluation pipelines for testing agent capabilities across various real-world tasks.

⚠️ Migration in Progress: We are currently migrating the benchmarks from OpenHands V0 to work with the OpenHands Software Agent SDK infrastructure in V1.

Available Benchmarks

Benchmark Description Status
SWE-Bench Software engineering tasks from GitHub issues ✅ Active
GAIA General AI assistant tasks requiring multi-step reasoning ✅ Active

See the individual benchmark directories for detailed usage instructions.

Quick Start

Prerequisites

Before running any benchmarks, you need to set up the environment and ensure the local Agent SDK submodule is initialized.

make build
📦 Submodule & Environment Setup (click to expand)

🧩 1. Initialize the Agent SDK submodule

The Benchmarks project uses a local git submodule for the OpenHands Agent SDK.
This ensures your code runs against a specific, reproducible commit.

Run once after cloning (already done in make build for you):

git submodule update --init --recursive

This command will:

  • clone the SDK into vendor/software-agent-sdk/
  • check out the exact commit pinned by this repo
  • make it available for local development (uv sync will install from the local folder)

If you ever clone this repository again, remember to re-initialize the submodule with the same command.


🏗️ 2. Build the environment

Once the submodule is set up, install dependencies via uv:

make build

This runs:

uv sync

and ensures the openhands-* packages (SDK, tools, workspace, agent-server) are installed from the local workspace declared in pyproject.toml.


🔄 3. Update the submodule (when SDK changes)

If you want to update to a newer version of the SDK:

cd vendor/software-agent-sdk
git fetch
git checkout <new_commit_or_branch>
cd ../..
git add vendor/software-agent-sdk
git commit -m "Update software-agent-sdk submodule to <new_commit_sha>"

Then re-run:

make build

to rebuild your environment with the new SDK code.

Configuration

Configure Your LLM

All benchmarks require an LLM configuration file. Define your LLM config as a JSON following the model fields in the LLM class.

Example (.llm_config/example.json):

{
  "model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
  "base_url": "https://llm-proxy.eval.all-hands.dev",
  "api_key": "YOUR_API_KEY_HERE"
}

Validate your configuration:

uv run validate-cfg .llm_config/YOUR_CONFIG_PATH.json

Running Benchmarks

After setting up the environment and configuring your LLM, see the individual benchmark directories for specific usage instructions.

Links

About

Evaluation harness for OpenHands V1.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 6