Agent Arena

A reproducible harness for running Claude Code and Codex side-by-side in isolated Docker containers and comparing the outputs of a concrete coding task.

Each model gets an identical prompt, identical pre-installed tools, and a time-boxed workspace. Results — generated code plus full conversation logs — are archived per run for later comparison.

Requirements

Requirement	Notes
Linux (recommended) or macOS / WSL2	`network_mode: host` works natively on Linux. On macOS, Docker Desktop's userspace network stack means OAuth callbacks may behave differently.
Docker Engine ≥ 24 + Docker Compose v2	`docker compose` (with a space, not `docker-compose`) must be available. Run `docker compose version` to confirm.
bash ≥ 4	Required to run the `arena` CLI. macOS ships bash 3 — upgrade via Homebrew: `brew install bash`.
rsync	Used by `arena run` to seed workspaces. Pre-installed on most Linux systems.
Python ≥ 3.10 (host)	`arena status` and `arena archive` call `scripts/arena_status.py` and `scripts/arena_report.py` from the host.
~7 GB free disk	Two Docker images are ~3 GB each (Playwright Chromium bundled).
4 GB RAM (recommended)	Both containers may run Playwright Chromium simultaneously.
Network	`network_mode: host` — both containers share the host network stack. Required for OAuth callbacks and external web access.

Repository layout

agent-arena/
├── arena                    # single CLI: build / up / verify / run / archive / reset
├── Dockerfile               # multi-stage: base → claude / codex
├── docker-compose.yml       # two parallel services
├── inputs/                  # evaluation inputs — one directory per test
│   ├── url-shortener/
│   │   └── prompt.md        # plain prompt, empty workspace
│   └── refactor-fastapi/
│       ├── prompt.md
│       └── initial_workspace/   # copied into workspace/<agent>/ before each run
│           ├── app.py
│           ├── tests/
│           ├── AGENTS.md        # agent-specific instructions (optional)
│           └── CLAUDE.md        # see caveat below
├── .env.sample              # environment variable template
├── scripts/
│   └── verify.sh            # in-container sanity check (mounted at /scripts/)
├── auth/                    # OAuth tokens — gitignored, persisted via volume
│   ├── claude/
│   └── codex/
├── workspace/               # live working directories — gitignored
│   ├── claude/
│   └── codex/
└── results/                 # archived run outputs — gitignored
    └── run-01/
        ├── claude/
        │   ├── workspace/   # files produced by Claude Code
        │   └── session.jsonl
        ├── codex/
        │   ├── workspace/   # files produced by Codex
        │   └── session.jsonl
        └── meta.txt

Pre-installed tools (both containers)

Category	Packages / binaries
Python 3.12	`requests`, `httpx`, `curl_cffi`, `fastapi`, `uvicorn`, `playwright`, `beautifulsoup4`, `lxml`, `pytest`
Node.js 20	`npm`
Python tooling	`uv`
Browser	Playwright Chromium at `/opt/ms-playwright`
Shell utilities	`git`, `curl`, `wget`, `jq`, `ripgrep`, `vim`, `tree`

Everything is pre-installed so neither model wastes time on environment setup.

Setup

1. Copy the environment template

cp .env.sample .env

Leave both token/key fields blank for now — the next steps will populate them.

2. Build images (first time only, ~15–20 min)

./arena build

Images are ~2–3 GB because Playwright Chromium is bundled.

3. Start containers

./arena up

Both containers idle on sleep infinity. Login and task execution happen via exec.

4. Authenticate

Choose one method per model.

Claude Code — subscription OAuth (recommended for Pro/Max users):

./arena login claude
# 1. Visit the URL shown and sign in in your browser.
# 2. Copy the authorization code from the browser page.
# 3. Paste it at the terminal prompt.
# 4. A long-lived (1-year) token starting with `sk-ant-oat01-` is printed.
# 5. Paste that token into .env as `CLAUDE_CODE_OAUTH_TOKEN=...`
# 6. Run `./arena down && ./arena up` so the container picks up the env var.

Alternative — API key (pay-as-you-go): set ANTHROPIC_API_KEY in .env instead and restart the container.

Codex — ChatGPT OAuth (recommended for ChatGPT Plus users):

./arena login codex
# Follow the URL in the browser. Credentials are written to ./auth/codex/
# and persist across rebuilds — no manual copy/paste required.

Alternative — API key: set OPENAI_API_KEY in .env instead.

5. Verify the environment

./arena verify

Both models should report no [FAIL] lines.

Credentials survive ./arena down, ./arena up, and ./arena build — you should not need to re-authenticate after rebuilds.

Inputs

An input is a directory under inputs/ with:

prompt.md — the user-facing prompt, identical for both models (required)
initial_workspace/ — seed files copied into workspace/<agent>/ before the run starts. Use this for fixtures, partially-written code, or existing projects that the agents should modify (optional)

Seeding is additive (rsync without --delete) — run ./arena reset first if you want a clean workspace.

Caveat — CLAUDE.md and AGENTS.md inside initial_workspace/. Claude Code reads CLAUDE.md and Codex reads AGENTS.md as agent-specific instructions. If the two files contain different content the agents are no longer solving the same problem, and the comparison loses its meaning. Keep them identical (or symlinked / cross-referenced, as in inputs/refactor-fastapi/) unless you deliberately want to probe per-agent instruction handling.

Running an evaluation

./arena run                              # both models, default input, 2h each
./arena run --only claude                # one model only
./arena run url-shortener                # explicit input by directory name
./arena run refactor-fastapi             # input with initial_workspace seed

Model and effort overrides

By default each agent uses whatever its CLI defaults to. Override per agent:

Flag	Accepts	Passed to
`--claude-model <alias\|id>`	`opus`, `sonnet`, or a full id like `claude-opus-4-7`	`claude --model`
`--claude-effort <level>`	`low`, `medium`, `high`, `xhigh`, `max`	`claude --effort`
`--codex-model <id>`	e.g. `gpt-5-codex`	`codex exec --model`
`--codex-effort <level>`	`minimal`, `low`, `medium`, `high`	`codex exec -c model_reasoning_effort=...`

Example:

./arena run refactor-fastapi \
  --claude-model opus --claude-effort high \
  --codex-model  gpt-5-codex --codex-effort high

Selected values are recorded in results/run-N/meta.txt alongside runtime and exit code for each agent.

Headless logs land in claude-run.log and codex-run.log; live in-workspace activity is in workspace/<model>/session.log.

For interactive debugging, attach to a container and drive the CLI yourself:

docker exec -it arena-claude bash
docker exec -it arena-codex  bash

Inside, run with full auto-approve against the same prompt:

claude --dangerously-skip-permissions -p "$(cat /inputs/url-shortener/prompt.md)"
codex  --dangerously-bypass-approvals-and-sandbox "$(cat /inputs/url-shortener/prompt.md)"

Monitoring a live run

Long runs (tool loops, multi-hour tasks) benefit from a health check without opening two docker exec shells. ./arena status parses the session JSONL that each agent writes inside its container and prints a one-line summary:

./arena status
# claude  elapsed=12m03s  events=47  tools=11  chars=3842  tokens=in:82.1k/out:4.2k  last=tool:Edit (3s ago)
# codex   elapsed=12m03s  events=119 tools=23  chars=5117  tokens=in:194.0k/out:8.8k last=tool:function_call (1s ago)

./arena status --watch          # refresh every 3s, Ctrl-C to exit
./arena status --watch 10       # refresh every 10s

./arena status --tail           # last 10 assistant text + tool calls, both agents
./arena status --tail claude    # one agent only
./arena status --tail --tail-n 30

The reader is read-only and touches only the JSONL files; it has no effect on the running agents.

Archiving and resetting

./arena archive   # snapshot the latest run to results/run-N/
./arena report    # print results/run-N/report.md (latest run by default)
./arena report 03 # print a specific run (accepts "03", "run-03", or a path)
./arena reset     # wipe workspace + in-container session history

results/run-N/ contains:

claude/workspace/ — files produced by Claude Code
codex/workspace/ — files produced by Codex
claude/session.jsonl, codex/session.jsonl — raw conversation logs
meta.txt — input name, per-agent model/effort/runtime/exit, CLI versions, archive + start timestamps
report.md — markdown summary generated at archive time: per-agent model, runtime, tool/token counts, and the workspace file tree. ./arena report just cats this file.

Key flags

Claude Code

Flag	Purpose
`--dangerously-skip-permissions`	Bypass all tool-use approval prompts (safe inside the container)
`CLAUDE_CODE_DISABLE_AUTO_MEMORY=1`	Disable auto memory (set in Dockerfile for evaluation isolation)

Codex

Flag	Purpose
`--dangerously-bypass-approvals-and-sandbox`	Bypass sandbox + approvals (sandbox blocks network by default — required for web access)
`--skip-git-repo-check`	Allow starting in a non-git directory

Notes

Log in one at a time. ./arena login (no arg) handles this automatically by running claude first, then codex.
network_mode: host — both OAuth callbacks and external web requests go through the host network stack.
Codex ChatGPT Memory is a server-side feature and is not isolated by the container. If evaluation purity matters, clear ChatGPT Memory in account settings before each run.
Playwright is pre-installed with --with-deps so system library dependencies are satisfied inside the container.

Cleanup

./arena down                    # stop and remove containers (keep images)
docker compose down --rmi all   # remove images as well
./arena reset                   # wipe workspace (keep auth tokens)

# Full reset — requires re-login
rm -rf auth/claude/* auth/codex/* workspace/claude/* workspace/codex/*

Testing

Lightweight CI (no credentials, no docker build)

Runs on every push via GitHub Actions. Also runnable locally:

bash tests/run-ci.sh

Covers:

arena bash syntax + error-path smoke tests
inputs/ directory structure validation
docker compose config (compose YAML parse, no daemon build needed)
pytest unit tests for arena_status.py and arena_report.py parsers

Full E2E test (credentials + running containers required)

Verifies the complete pipeline end-to-end: build → up → verify → run → archive → artifact check.

./arena test

This runs a trivial smoke prompt (inputs/_smoke/prompt.md) against both agents with a 10-minute timeout and checks that workspaces, session logs, and a report are all produced. Run once after first-time setup and after any structural changes to the harness.

Note: ./arena test leaves containers running and creates a results/run-N/ entry. Run ./arena down and ./arena reset afterwards if you want a clean state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Arena

Requirements

Repository layout

Pre-installed tools (both containers)

Setup

1. Copy the environment template

2. Build images (first time only, ~15–20 min)

3. Start containers

4. Authenticate

5. Verify the environment

Inputs

Running an evaluation

Model and effort overrides

Monitoring a live run

Archiving and resetting

Key flags

Notes

Cleanup

Testing

Lightweight CI (no credentials, no docker build)

Full E2E test (credentials + running containers required)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
inputs		inputs
scripts		scripts
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
arena		arena
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Agent Arena

Requirements

Repository layout

Pre-installed tools (both containers)

Setup

1. Copy the environment template

2. Build images (first time only, ~15–20 min)

3. Start containers

4. Authenticate

5. Verify the environment

Inputs

Running an evaluation

Model and effort overrides

Monitoring a live run

Archiving and resetting

Key flags

Notes

Cleanup

Testing

Lightweight CI (no credentials, no docker build)

Full E2E test (credentials + running containers required)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages