Multi-agent build (codex|claude) + auto-resume + agent-config sync by nprodromou · Pull Request #3 · nprodromou/codex-shell

nprodromou · 2026-05-07T05:39:56Z

Summary

Genericizes `codex-shell` into a multi-agent shell image. One Dockerfile, build matrix on `AGENT=codex|claude`, two image tags published per main push.

Dockerfile

`ARG AGENT` validated at build time (codex or claude)
Per-agent npm install — `@openai/codex` vs `@anthropic-ai/claude-code`
Non-root user named after AGENT (uid/gid 1000), `HOME=/home/${AGENT}`
`npm install -g npm@latest` after Node install — NodeSource lags upstream

Entrypoint

Per-agent env-var contract: codex needs `GH_TOKEN` + optional `CODEX_SESSION`; claude needs only `GH_TOKEN` (interactive `/login` on first connect, persists on PVC)
Auto-resume on connect: `codex resume --last` / `claude --continue`. Falls through to fresh agent if no session exists, then to bash if the agent exits.
Pulls `nprodromou/agent-config` on every boot and symlinks `instructions/CLAUDE.md` into the agent's expected path (`~~/.codex/AGENTS.md` or `~~/.claude/CLAUDE.md`) — instruction updates reach the pod on restart with no image rebuild.

Workflow

Build matrix `[codex, claude]`
Per-agent tags: `codex-latest`, `claude-latest`, `sha-XXXXX-codex`, `sha-XXXXX-claude`, etc. Fresh tag namespace, sidesteps the kubelet cache wedge that froze `:latest` on the prior image.
Independent GHA cache scopes per agent.

Follow-up (not in this PR)

apk8s codex-cli HelmRelease tag → `codex-latest`
New apk8s claude-cli app pointing at `claude-latest`, ExternalSecret pulling `claude-github-pat` from the Kubernetes vault, HTTPRoute for `claude.prodromou.com` on envoy-internal

🤖 Generated with Claude Code

…ig sync Genericizes the image so the same Dockerfile produces both the codex variant and a new claude variant, each with its own GHCR tag. Dockerfile - ARG AGENT=codex|claude validated at build time - Per-agent npm install (@openai/codex vs @anthropic-ai/claude-code) - Non-root user named after AGENT (uid/gid 1000), HOME=/home/${AGENT} - npm upgraded to latest after Node install — NodeSource lags Entrypoint - Per-agent env-var contract: codex requires GH_TOKEN + optional CODEX_SESSION (seeds auth.json); claude requires only GH_TOKEN (interactive /login on first connect, persists on PVC) - Auto-resume on connect: `codex resume --last` / `claude --continue`, falling through to a fresh agent run if no session exists, then to bash if the agent exits - Pulls nprodromou/agent-config and symlinks instructions/CLAUDE.md into the agent's expected path: ~/.codex/AGENTS.md or ~/.claude/CLAUDE.md. Pulls fresh on every restart, no image rebuild needed for instruction updates. Workflow - Build matrix on agent: [codex, claude] - Per-agent tags only (no shared :latest): codex-latest, claude-latest, sha-XXXXX-codex, sha-XXXXX-claude, etc. Fresh tag namespace avoids the kubelet/containerd cache wedge that froze :latest on the older bubblewrap-less image. - Independent gha cache scopes per agent (cache-from/cache-to) Apk8s manifests still need updating to point at codex-latest (instead of :latest) and to add a parallel claude-cli app — separate follow-up PR.

…#21) * codex-shell: AGENT_MODE=smoke-test for slot startup probe (WOVED-147) Second slice of WOVED-147 after the uid pin (#20). The slot model assumes auth credentials remain usable across image rotations — but four failure modes can break that silently. The uid pin defends one (#3); this script defends the other three at first-boot: #1 refresh token expired on the wall clock #2 CLI auth format changed incompatibly #4 stricter cred-format check on a newer CLI version bin/smoke_test.py: - Verifies the agent's CLI binary loads (`<binary> --version` exits 0 within 10s) — catches image regressions at the binary layer. - Verifies the credentials file exists at the expected path, is non-empty, and parses as JSON — cheapest "format sanity" check that catches #2 and #4 without making any network call. - Exits with structured codes: 0 (ready), 64 (CLI broken), 65 (creds missing — slot needs init), 66 (creds invalid — slot needs re-auth). Manager-side dispatch keys off these values; do not renumber without bumping Manager in lockstep. - Stdlib only — same constraint as worker.py + auth_init.py. The slot pod's startup probe runs early in boot, before any pip would have a chance to land. bin/entrypoint.sh: - Adds `smoke-test)` case to the AGENT_MODE dispatch. - Documents required env (WOVED_TASK_AGENT) + the four exit codes in the case body so an operator reading the entrypoint sees the contract without grepping for smoke_test.py. Dockerfile: - COPY the new script to /usr/local/bin alongside auth_init.py. Manifest-level wiring (kubernetes startupProbe on slot worker pods) lands in nprodromou/woved alongside WOVED-152 (worker-job slot mount), so the chart template has somewhere to attach the probe. Without WOVED-152 there are no slot worker pods to probe. Local end-to-end exercise on the dev machine confirmed all five exit-code paths (missing env / unmapped agent / creds missing / creds invalid / OK) — the OK path even picked up the real claude CLI's version string in the structured output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * smoke-test: claude detection matches auth-init snapshot model Codex caught (codex-shell#21 review) that pinning the claude credential path to ~/.claude/credentials.json would false-fail healthy slots whose CLI wrote to e.g. ~/.claude/.credentials/session.json. auth_init.py deliberately does NOT pin a filename for exactly this reason — it uses snapshot-diff over the entire ~/.claude/ tree to detect "auth happened" robustly across CLI version changes. Smoke test now matches that model for claude: - Walk ~/.claude/ for any non-symlink regular file outside the entrypoint's pre-populated baseline (CLAUDE.md, config.toml, settings.json — names empirically copied in by entrypoint.sh BEFORE auth-init runs). - Any candidate file → creds-ok (exit 0). - No candidates → creds-missing (exit 65). - Walk failure (permission denied, etc.) → creds-invalid (exit 66, same shape as Codex stat() failure path). Codex CLI side stays pinned (~/.codex/auth.json) — Codex CLI contract is stable on that path AND the entrypoint writes there from CODEX_SESSION at first boot. Codex's review specifically flagged only the claude side. No JSON parse for claude — auth_init.py doesn't parse either, because the format may differ across CLI versions and a parse-failure on a real-but-unfamiliar artifact would be a worse failure than a false-pass on a corrupt one (which the next real task would catch immediately). Codex JSON parse stays because the codex CLI contract IS stable. Local exercise of all six cases (missing env / no ~/.claude / empty .claude / baseline-only / .credentials/session.json / legacy credentials.json) confirmed correct exit codes. The exact repro from Codex's review (a file at .credentials/session.json) now exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nprodromou added 3 commits May 6, 2026 22:39

README: document multi-agent build, runtime contract, connect flow

dfa92d1

Dockerfile: revert npm self-upgrade — triggers module-resolution bug

0a55ebd

nprodromou merged commit 612447c into main May 7, 2026
2 checks passed

nprodromou deleted the refactor/multi-agent-build branch May 7, 2026 05:45

claude-prodromou mentioned this pull request May 9, 2026

codex-shell: AGENT_MODE=smoke-test for slot startup probe (WOVED-147) #21

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-agent build (codex|claude) + auto-resume + agent-config sync#3

Multi-agent build (codex|claude) + auto-resume + agent-config sync#3
nprodromou merged 3 commits into
mainfrom
refactor/multi-agent-build

nprodromou commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nprodromou commented May 7, 2026

Summary

Dockerfile

Entrypoint

Workflow

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant