Open-source benchmark suite for AI agent sandbox providers.
Measure what matters: Time, cost, errors, friction, and capabilities when an AI agent uses your sandbox.
AI agents are the new power users. They don't read your UI — they parse your docs, call your API, and move on. Whether you're building RL training loops, agentic coding assistants, or autonomous sub-agent pipelines, the sandbox is the bottleneck. This benchmark measures how well sandbox providers serve AI agents across real-world workloads.
| Provider | Type | Status |
|---|---|---|
| E2B | Firecracker microVM | Supported |
| Daytona | Docker | Supported |
| Modal | Container | Supported |
| CodeSandbox | Docker | Supported |
| Fly.io Machines | Firecracker | Supported |
| VMVM | microVM (Meta internal) | Supported |
| Docker Image | Local container | Supported |
| MicroVM | Local microVM | Supported |
sandbox-bench runs modular test suites inside a single sandbox lifecycle. Pick what you need or run them all.
| Suite | What it tests | Example phases |
|---|---|---|
| basic | Hello-world execution, file I/O | execute_hello, file_io |
| competitive | Baekjoon/CP-style: stdin piping, compilation, timeouts | stdin_piping, gcc, g++, exec_timeout |
| swe | SWE-bench-style: package install, git, pytest, network | network_access, pip_install, git_clone, pytest |
| environment | Complex onramp: Node.js, npm, venv, multi-step builds | nodejs, npm, project_clone, multi_step_build |
| performance | Agent spawn latency, warm start, file I/O throughput | agent_spawn, warm_start, rapid_exec, file_io_10mb |
| networking | Protocol support: TCP, UDP, HTTPS, WebSocket, SSH, inbound | tcp_outbound, udp_outbound, websocket, inbound_listen |
| mcp | MCP tool-use integration | tool_discovery, tool_invocation |
| training_batch | Concurrent sandbox scaling (256–4096) | tier1_256, tier2_1024, tier3_4096 |
| agentic_session | Snapshot/restore for long-running agents | snapshot, destroy_and_restore, verify_restore |
| full | All of the above | — |
Default is basic for fast iteration. Use --suite full for comprehensive benchmarking.
Metrics are weighted and normalized to produce a 0–100 score.
Base weights (basic suite only):
| Metric | Weight | Lower is better? |
|---|---|---|
| Time | 30% | Yes |
| Errors | 20% | Yes |
| Friction | 15% | Yes |
| Tool Calls | 15% | Yes |
| Cost | 10% | Yes |
| Discoverability | 10% | No (higher = better) |
Full weights (when extended suites run and capabilities data is present):
| Metric | Weight | Description |
|---|---|---|
| Time | 25% | Seconds from API key to working sandbox |
| Errors | 20% | Errors encountered during the run |
| Friction | 15% | Manual steps or workarounds needed |
| Tool Calls | 10% | Number of API/SDK calls required |
| Cost | 10% | Provider-specific sandbox cost per run |
| Discoverability | 10% | How easy to find correct API usage (1–5) |
| Capabilities | 10% | Fraction of tested capabilities supported |
Grades: A (85–100), B (70–84), C (55–69), D (40–54), F (0–39)
# Install
pip install sandbox-bench
# Or install from source
git clone https://github.com/zkwentz/sandbox-bench.git
cd sandbox-bench
pip install -e ".[e2b,daytona,modal]"
# List available providers and suites
sandbox-bench list
sandbox-bench suites
# Run basic suite against all providers with API keys
sandbox-bench run --all
# Run against a specific provider
sandbox-bench run -p e2b
# Run specific suites
sandbox-bench run -p e2b -s competitive -s swe
# Run everything
sandbox-bench run --all --suite full
# Output JSON results
sandbox-bench run --all -o results.json
# Control number of benchmark runs (default: 3)
sandbox-bench run --all -n 5Set API keys via environment variables:
export E2B_API_KEY="..."
export DAYTONA_API_KEY="..."
export MODAL_TOKEN_ID="..."
export MODAL_TOKEN_SECRET="..."
export CODESANDBOX_API_KEY="..."
export FLY_API_TOKEN="..."Or use a .env file:
sandbox-bench run --all --env-file .envVMVM requires vacli on PATH, valid Meta TLS credentials, and a FaaS tenant ID. Set the tenant ID via environment variable:
export VMVM_TENANT_ID="your-tenant-id"
sandbox-bench run -p vmvmFor multi-suite runs with different tenants per suite, copy .env.meta.example to .env.meta and fill in your values. The helper scripts (run-vmvm-bench.sh, run-vmvm-tier1.sh) source this file automatically.
Implement the SandboxProvider interface:
from sandbox_bench.provider import SandboxProvider, ProviderInfo, register_provider
class MyProvider(SandboxProvider):
name = "my-provider"
info = ProviderInfo(
name="my-provider",
description="My sandbox provider",
docs_url="https://docs.example.com",
)
async def authenticate(self, api_key: str) -> None:
"""Connect to the provider."""
...
async def create_sandbox(self, image=None, timeout_seconds=300) -> str:
"""Create a new sandbox, return its ID."""
...
async def execute(self, sandbox_id, code, language="python", timeout_seconds=30):
"""Execute code, return (stdout, stderr, exit_code)."""
...
async def execute_command(self, sandbox_id, command, timeout_seconds=30):
"""Execute a shell command, return (stdout, stderr, exit_code)."""
...
async def write_file(self, sandbox_id, path, content) -> None:
"""Write a file to the sandbox."""
...
async def read_file(self, sandbox_id, path):
"""Read a file from the sandbox."""
...
async def destroy(self, sandbox_id) -> None:
"""Destroy the sandbox."""
...
register_provider(MyProvider)execute_command() has a default implementation that delegates to execute(code, language="sh"), so you only need to override it if your provider has a more efficient shell execution path.
The benchmark can run in "agent mode" where an AI agent attempts to use each provider from scratch:
sandbox-bench run --all --agent-mode --model claude-opus-4This measures real-world agent experience — not just API performance.
# .github/workflows/benchmark.yml
name: Sandbox Benchmark
on:
schedule:
- cron: '0 0 * * 0' # Weekly
workflow_dispatch:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install sandbox-bench
- run: sandbox-bench run --all --suite full --output results.json
env:
E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
DAYTONA_API_KEY: ${{ secrets.DAYTONA_API_KEY }}
- uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results.jsonPRs welcome! Especially for:
- New provider implementations
- New test suites or phases
- Improved scoring algorithms
- Dashboard visualizations
MIT
Inspired by 2027.dev/arena — we wanted an open-source version anyone can run and extend.