A reproducible harness for running Claude Code and Codex side-by-side in isolated Docker containers and comparing the outputs of a concrete coding task.
Each model gets an identical prompt, identical pre-installed tools, and a time-boxed workspace. Results — generated code plus full conversation logs — are archived per run for later comparison.
| Requirement | Notes |
|---|---|
| Linux (recommended) or macOS / WSL2 | network_mode: host works natively on Linux. On macOS, Docker Desktop's userspace network stack means OAuth callbacks may behave differently. |
| Docker Engine ≥ 24 + Docker Compose v2 | docker compose (with a space, not docker-compose) must be available. Run docker compose version to confirm. |
| bash ≥ 4 | Required to run the arena CLI. macOS ships bash 3 — upgrade via Homebrew: brew install bash. |
| rsync | Used by arena run to seed workspaces. Pre-installed on most Linux systems. |
| Python ≥ 3.10 (host) | arena status and arena archive call scripts/arena_status.py and scripts/arena_report.py from the host. |
| ~7 GB free disk | Two Docker images are ~3 GB each (Playwright Chromium bundled). |
| 4 GB RAM (recommended) | Both containers may run Playwright Chromium simultaneously. |
| Network | network_mode: host — both containers share the host network stack. Required for OAuth callbacks and external web access. |
agent-arena/
├── arena # single CLI: build / up / verify / run / archive / reset
├── Dockerfile # multi-stage: base → claude / codex
├── docker-compose.yml # two parallel services
├── inputs/ # evaluation inputs — one directory per test
│ ├── url-shortener/
│ │ └── prompt.md # plain prompt, empty workspace
│ └── refactor-fastapi/
│ ├── prompt.md
│ └── initial_workspace/ # copied into workspace/<agent>/ before each run
│ ├── app.py
│ ├── tests/
│ ├── AGENTS.md # agent-specific instructions (optional)
│ └── CLAUDE.md # see caveat below
├── .env.sample # environment variable template
├── scripts/
│ └── verify.sh # in-container sanity check (mounted at /scripts/)
├── auth/ # OAuth tokens — gitignored, persisted via volume
│ ├── claude/
│ └── codex/
├── workspace/ # live working directories — gitignored
│ ├── claude/
│ └── codex/
└── results/ # archived run outputs — gitignored
└── run-01/
├── claude/
│ ├── workspace/ # files produced by Claude Code
│ └── session.jsonl
├── codex/
│ ├── workspace/ # files produced by Codex
│ └── session.jsonl
└── meta.txt
| Category | Packages / binaries |
|---|---|
| Python 3.12 | requests, httpx, curl_cffi, fastapi, uvicorn, playwright, beautifulsoup4, lxml, pytest |
| Node.js 20 | npm |
| Python tooling | uv |
| Browser | Playwright Chromium at /opt/ms-playwright |
| Shell utilities | git, curl, wget, jq, ripgrep, vim, tree |
Everything is pre-installed so neither model wastes time on environment setup.
cp .env.sample .envLeave both token/key fields blank for now — the next steps will populate them.
./arena buildImages are ~2–3 GB because Playwright Chromium is bundled.
./arena upBoth containers idle on sleep infinity. Login and task execution happen via exec.
Choose one method per model.
Claude Code — subscription OAuth (recommended for Pro/Max users):
./arena login claude
# 1. Visit the URL shown and sign in in your browser.
# 2. Copy the authorization code from the browser page.
# 3. Paste it at the terminal prompt.
# 4. A long-lived (1-year) token starting with `sk-ant-oat01-` is printed.
# 5. Paste that token into .env as `CLAUDE_CODE_OAUTH_TOKEN=...`
# 6. Run `./arena down && ./arena up` so the container picks up the env var.Alternative — API key (pay-as-you-go): set ANTHROPIC_API_KEY in .env instead
and restart the container.
Codex — ChatGPT OAuth (recommended for ChatGPT Plus users):
./arena login codex
# Follow the URL in the browser. Credentials are written to ./auth/codex/
# and persist across rebuilds — no manual copy/paste required.Alternative — API key: set OPENAI_API_KEY in .env instead.
./arena verifyBoth models should report no [FAIL] lines.
Credentials survive ./arena down, ./arena up, and ./arena build — you
should not need to re-authenticate after rebuilds.
An input is a directory under inputs/ with:
prompt.md— the user-facing prompt, identical for both models (required)initial_workspace/— seed files copied intoworkspace/<agent>/before the run starts. Use this for fixtures, partially-written code, or existing projects that the agents should modify (optional)
Seeding is additive (rsync without --delete) — run ./arena reset first
if you want a clean workspace.
Caveat — CLAUDE.md and AGENTS.md inside initial_workspace/.
Claude Code reads CLAUDE.md and Codex reads AGENTS.md as
agent-specific instructions. If the two files contain different content the
agents are no longer solving the same problem, and the comparison loses its
meaning. Keep them identical (or symlinked / cross-referenced, as in
inputs/refactor-fastapi/) unless you deliberately want to probe per-agent
instruction handling.
./arena run # both models, default input, 2h each
./arena run --only claude # one model only
./arena run url-shortener # explicit input by directory name
./arena run refactor-fastapi # input with initial_workspace seedBy default each agent uses whatever its CLI defaults to. Override per agent:
| Flag | Accepts | Passed to |
|---|---|---|
--claude-model <alias|id> |
opus, sonnet, or a full id like claude-opus-4-7 |
claude --model |
--claude-effort <level> |
low, medium, high, xhigh, max |
claude --effort |
--codex-model <id> |
e.g. gpt-5-codex |
codex exec --model |
--codex-effort <level> |
minimal, low, medium, high |
codex exec -c model_reasoning_effort=... |
Example:
./arena run refactor-fastapi \
--claude-model opus --claude-effort high \
--codex-model gpt-5-codex --codex-effort highSelected values are recorded in results/run-N/meta.txt alongside runtime and
exit code for each agent.
Headless logs land in claude-run.log and codex-run.log; live in-workspace
activity is in workspace/<model>/session.log.
For interactive debugging, attach to a container and drive the CLI yourself:
docker exec -it arena-claude bash
docker exec -it arena-codex bashInside, run with full auto-approve against the same prompt:
claude --dangerously-skip-permissions -p "$(cat /inputs/url-shortener/prompt.md)"
codex --dangerously-bypass-approvals-and-sandbox "$(cat /inputs/url-shortener/prompt.md)"Long runs (tool loops, multi-hour tasks) benefit from a health check without
opening two docker exec shells. ./arena status parses the session JSONL
that each agent writes inside its container and prints a one-line summary:
./arena status
# claude elapsed=12m03s events=47 tools=11 chars=3842 tokens=in:82.1k/out:4.2k last=tool:Edit (3s ago)
# codex elapsed=12m03s events=119 tools=23 chars=5117 tokens=in:194.0k/out:8.8k last=tool:function_call (1s ago)
./arena status --watch # refresh every 3s, Ctrl-C to exit
./arena status --watch 10 # refresh every 10s
./arena status --tail # last 10 assistant text + tool calls, both agents
./arena status --tail claude # one agent only
./arena status --tail --tail-n 30The reader is read-only and touches only the JSONL files; it has no effect on the running agents.
./arena archive # snapshot the latest run to results/run-N/
./arena report # print results/run-N/report.md (latest run by default)
./arena report 03 # print a specific run (accepts "03", "run-03", or a path)
./arena reset # wipe workspace + in-container session historyresults/run-N/ contains:
claude/workspace/— files produced by Claude Codecodex/workspace/— files produced by Codexclaude/session.jsonl,codex/session.jsonl— raw conversation logsmeta.txt— input name, per-agent model/effort/runtime/exit, CLI versions, archive + start timestampsreport.md— markdown summary generated at archive time: per-agent model, runtime, tool/token counts, and the workspace file tree../arena reportjustcats this file.
Claude Code
| Flag | Purpose |
|---|---|
--dangerously-skip-permissions |
Bypass all tool-use approval prompts (safe inside the container) |
CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 |
Disable auto memory (set in Dockerfile for evaluation isolation) |
Codex
| Flag | Purpose |
|---|---|
--dangerously-bypass-approvals-and-sandbox |
Bypass sandbox + approvals (sandbox blocks network by default — required for web access) |
--skip-git-repo-check |
Allow starting in a non-git directory |
- Log in one at a time.
./arena login(no arg) handles this automatically by running claude first, then codex. network_mode: host— both OAuth callbacks and external web requests go through the host network stack.- Codex ChatGPT Memory is a server-side feature and is not isolated by the container. If evaluation purity matters, clear ChatGPT Memory in account settings before each run.
- Playwright is pre-installed with
--with-depsso system library dependencies are satisfied inside the container.
./arena down # stop and remove containers (keep images)
docker compose down --rmi all # remove images as well
./arena reset # wipe workspace (keep auth tokens)
# Full reset — requires re-login
rm -rf auth/claude/* auth/codex/* workspace/claude/* workspace/codex/*Runs on every push via GitHub Actions. Also runnable locally:
bash tests/run-ci.shCovers:
arenabash syntax + error-path smoke testsinputs/directory structure validationdocker compose config(compose YAML parse, no daemon build needed)- pytest unit tests for
arena_status.pyandarena_report.pyparsers
Verifies the complete pipeline end-to-end: build → up → verify → run → archive → artifact check.
./arena testThis runs a trivial smoke prompt (inputs/_smoke/prompt.md) against both agents with a 10-minute timeout and checks that workspaces, session logs, and a report are all produced. Run once after first-time setup and after any structural changes to the harness.
Note:
./arena testleaves containers running and creates aresults/run-N/entry. Run./arena downand./arena resetafterwards if you want a clean state.