Skip to content

myjung/agent-arena

Repository files navigation

English | 한국어

Agent Arena

CI

A reproducible harness for running Claude Code and Codex side-by-side in isolated Docker containers and comparing the outputs of a concrete coding task.

Each model gets an identical prompt, identical pre-installed tools, and a time-boxed workspace. Results — generated code plus full conversation logs — are archived per run for later comparison.

Requirements

Requirement Notes
Linux (recommended) or macOS / WSL2 network_mode: host works natively on Linux. On macOS, Docker Desktop's userspace network stack means OAuth callbacks may behave differently.
Docker Engine ≥ 24 + Docker Compose v2 docker compose (with a space, not docker-compose) must be available. Run docker compose version to confirm.
bash ≥ 4 Required to run the arena CLI. macOS ships bash 3 — upgrade via Homebrew: brew install bash.
rsync Used by arena run to seed workspaces. Pre-installed on most Linux systems.
Python ≥ 3.10 (host) arena status and arena archive call scripts/arena_status.py and scripts/arena_report.py from the host.
~7 GB free disk Two Docker images are ~3 GB each (Playwright Chromium bundled).
4 GB RAM (recommended) Both containers may run Playwright Chromium simultaneously.
Network network_mode: host — both containers share the host network stack. Required for OAuth callbacks and external web access.

Repository layout

agent-arena/
├── arena                    # single CLI: build / up / verify / run / archive / reset
├── Dockerfile               # multi-stage: base → claude / codex
├── docker-compose.yml       # two parallel services
├── inputs/                  # evaluation inputs — one directory per test
│   ├── url-shortener/
│   │   └── prompt.md        # plain prompt, empty workspace
│   └── refactor-fastapi/
│       ├── prompt.md
│       └── initial_workspace/   # copied into workspace/<agent>/ before each run
│           ├── app.py
│           ├── tests/
│           ├── AGENTS.md        # agent-specific instructions (optional)
│           └── CLAUDE.md        # see caveat below
├── .env.sample              # environment variable template
├── scripts/
│   └── verify.sh            # in-container sanity check (mounted at /scripts/)
├── auth/                    # OAuth tokens — gitignored, persisted via volume
│   ├── claude/
│   └── codex/
├── workspace/               # live working directories — gitignored
│   ├── claude/
│   └── codex/
└── results/                 # archived run outputs — gitignored
    └── run-01/
        ├── claude/
        │   ├── workspace/   # files produced by Claude Code
        │   └── session.jsonl
        ├── codex/
        │   ├── workspace/   # files produced by Codex
        │   └── session.jsonl
        └── meta.txt

Pre-installed tools (both containers)

Category Packages / binaries
Python 3.12 requests, httpx, curl_cffi, fastapi, uvicorn, playwright, beautifulsoup4, lxml, pytest
Node.js 20 npm
Python tooling uv
Browser Playwright Chromium at /opt/ms-playwright
Shell utilities git, curl, wget, jq, ripgrep, vim, tree

Everything is pre-installed so neither model wastes time on environment setup.

Setup

1. Copy the environment template

cp .env.sample .env

Leave both token/key fields blank for now — the next steps will populate them.

2. Build images (first time only, ~15–20 min)

./arena build

Images are ~2–3 GB because Playwright Chromium is bundled.

3. Start containers

./arena up

Both containers idle on sleep infinity. Login and task execution happen via exec.

4. Authenticate

Choose one method per model.

Claude Code — subscription OAuth (recommended for Pro/Max users):

./arena login claude
# 1. Visit the URL shown and sign in in your browser.
# 2. Copy the authorization code from the browser page.
# 3. Paste it at the terminal prompt.
# 4. A long-lived (1-year) token starting with `sk-ant-oat01-` is printed.
# 5. Paste that token into .env as `CLAUDE_CODE_OAUTH_TOKEN=...`
# 6. Run `./arena down && ./arena up` so the container picks up the env var.

Alternative — API key (pay-as-you-go): set ANTHROPIC_API_KEY in .env instead and restart the container.

Codex — ChatGPT OAuth (recommended for ChatGPT Plus users):

./arena login codex
# Follow the URL in the browser. Credentials are written to ./auth/codex/
# and persist across rebuilds — no manual copy/paste required.

Alternative — API key: set OPENAI_API_KEY in .env instead.

5. Verify the environment

./arena verify

Both models should report no [FAIL] lines.

Credentials survive ./arena down, ./arena up, and ./arena build — you should not need to re-authenticate after rebuilds.

Inputs

An input is a directory under inputs/ with:

  • prompt.md — the user-facing prompt, identical for both models (required)
  • initial_workspace/ — seed files copied into workspace/<agent>/ before the run starts. Use this for fixtures, partially-written code, or existing projects that the agents should modify (optional)

Seeding is additive (rsync without --delete) — run ./arena reset first if you want a clean workspace.

Caveat — CLAUDE.md and AGENTS.md inside initial_workspace/. Claude Code reads CLAUDE.md and Codex reads AGENTS.md as agent-specific instructions. If the two files contain different content the agents are no longer solving the same problem, and the comparison loses its meaning. Keep them identical (or symlinked / cross-referenced, as in inputs/refactor-fastapi/) unless you deliberately want to probe per-agent instruction handling.

Running an evaluation

./arena run                              # both models, default input, 2h each
./arena run --only claude                # one model only
./arena run url-shortener                # explicit input by directory name
./arena run refactor-fastapi             # input with initial_workspace seed

Model and effort overrides

By default each agent uses whatever its CLI defaults to. Override per agent:

Flag Accepts Passed to
--claude-model <alias|id> opus, sonnet, or a full id like claude-opus-4-7 claude --model
--claude-effort <level> low, medium, high, xhigh, max claude --effort
--codex-model <id> e.g. gpt-5-codex codex exec --model
--codex-effort <level> minimal, low, medium, high codex exec -c model_reasoning_effort=...

Example:

./arena run refactor-fastapi \
  --claude-model opus --claude-effort high \
  --codex-model  gpt-5-codex --codex-effort high

Selected values are recorded in results/run-N/meta.txt alongside runtime and exit code for each agent.

Headless logs land in claude-run.log and codex-run.log; live in-workspace activity is in workspace/<model>/session.log.

For interactive debugging, attach to a container and drive the CLI yourself:

docker exec -it arena-claude bash
docker exec -it arena-codex  bash

Inside, run with full auto-approve against the same prompt:

claude --dangerously-skip-permissions -p "$(cat /inputs/url-shortener/prompt.md)"
codex  --dangerously-bypass-approvals-and-sandbox "$(cat /inputs/url-shortener/prompt.md)"

Monitoring a live run

Long runs (tool loops, multi-hour tasks) benefit from a health check without opening two docker exec shells. ./arena status parses the session JSONL that each agent writes inside its container and prints a one-line summary:

./arena status
# claude  elapsed=12m03s  events=47  tools=11  chars=3842  tokens=in:82.1k/out:4.2k  last=tool:Edit (3s ago)
# codex   elapsed=12m03s  events=119 tools=23  chars=5117  tokens=in:194.0k/out:8.8k last=tool:function_call (1s ago)

./arena status --watch          # refresh every 3s, Ctrl-C to exit
./arena status --watch 10       # refresh every 10s

./arena status --tail           # last 10 assistant text + tool calls, both agents
./arena status --tail claude    # one agent only
./arena status --tail --tail-n 30

The reader is read-only and touches only the JSONL files; it has no effect on the running agents.

Archiving and resetting

./arena archive   # snapshot the latest run to results/run-N/
./arena report    # print results/run-N/report.md (latest run by default)
./arena report 03 # print a specific run (accepts "03", "run-03", or a path)
./arena reset     # wipe workspace + in-container session history

results/run-N/ contains:

  • claude/workspace/ — files produced by Claude Code
  • codex/workspace/ — files produced by Codex
  • claude/session.jsonl, codex/session.jsonl — raw conversation logs
  • meta.txt — input name, per-agent model/effort/runtime/exit, CLI versions, archive + start timestamps
  • report.md — markdown summary generated at archive time: per-agent model, runtime, tool/token counts, and the workspace file tree. ./arena report just cats this file.

Key flags

Claude Code

Flag Purpose
--dangerously-skip-permissions Bypass all tool-use approval prompts (safe inside the container)
CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 Disable auto memory (set in Dockerfile for evaluation isolation)

Codex

Flag Purpose
--dangerously-bypass-approvals-and-sandbox Bypass sandbox + approvals (sandbox blocks network by default — required for web access)
--skip-git-repo-check Allow starting in a non-git directory

Notes

  1. Log in one at a time. ./arena login (no arg) handles this automatically by running claude first, then codex.
  2. network_mode: host — both OAuth callbacks and external web requests go through the host network stack.
  3. Codex ChatGPT Memory is a server-side feature and is not isolated by the container. If evaluation purity matters, clear ChatGPT Memory in account settings before each run.
  4. Playwright is pre-installed with --with-deps so system library dependencies are satisfied inside the container.

Cleanup

./arena down                    # stop and remove containers (keep images)
docker compose down --rmi all   # remove images as well
./arena reset                   # wipe workspace (keep auth tokens)

# Full reset — requires re-login
rm -rf auth/claude/* auth/codex/* workspace/claude/* workspace/codex/*

Testing

Lightweight CI (no credentials, no docker build)

Runs on every push via GitHub Actions. Also runnable locally:

bash tests/run-ci.sh

Covers:

  • arena bash syntax + error-path smoke tests
  • inputs/ directory structure validation
  • docker compose config (compose YAML parse, no daemon build needed)
  • pytest unit tests for arena_status.py and arena_report.py parsers

Full E2E test (credentials + running containers required)

Verifies the complete pipeline end-to-end: build → up → verify → run → archive → artifact check.

./arena test

This runs a trivial smoke prompt (inputs/_smoke/prompt.md) against both agents with a 10-minute timeout and checks that workspaces, session logs, and a report are all produced. Run once after first-time setup and after any structural changes to the harness.

Note: ./arena test leaves containers running and creates a results/run-N/ entry. Run ./arena down and ./arena reset afterwards if you want a clean state.

About

Side-by-side evaluation harness for CLI coding agents (Claude Code, Codex) in isolated Docker containers

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors