Skip to content

Multi-agent build (codex|claude) + auto-resume + agent-config sync#3

Merged
nprodromou merged 3 commits into
mainfrom
refactor/multi-agent-build
May 7, 2026
Merged

Multi-agent build (codex|claude) + auto-resume + agent-config sync#3
nprodromou merged 3 commits into
mainfrom
refactor/multi-agent-build

Conversation

@nprodromou

Copy link
Copy Markdown
Owner

Summary

Genericizes `codex-shell` into a multi-agent shell image. One Dockerfile, build matrix on `AGENT=codex|claude`, two image tags published per main push.

Dockerfile

  • `ARG AGENT` validated at build time (codex or claude)
  • Per-agent npm install — `@openai/codex` vs `@anthropic-ai/claude-code`
  • Non-root user named after AGENT (uid/gid 1000), `HOME=/home/${AGENT}`
  • `npm install -g npm@latest` after Node install — NodeSource lags upstream

Entrypoint

  • Per-agent env-var contract: codex needs `GH_TOKEN` + optional `CODEX_SESSION`; claude needs only `GH_TOKEN` (interactive `/login` on first connect, persists on PVC)
  • Auto-resume on connect: `codex resume --last` / `claude --continue`. Falls through to fresh agent if no session exists, then to bash if the agent exits.
  • Pulls `nprodromou/agent-config` on every boot and symlinks `instructions/CLAUDE.md` into the agent's expected path (`/.codex/AGENTS.md` or `/.claude/CLAUDE.md`) — instruction updates reach the pod on restart with no image rebuild.

Workflow

  • Build matrix `[codex, claude]`
  • Per-agent tags: `codex-latest`, `claude-latest`, `sha-XXXXX-codex`, `sha-XXXXX-claude`, etc. Fresh tag namespace, sidesteps the kubelet cache wedge that froze `:latest` on the prior image.
  • Independent GHA cache scopes per agent.

Follow-up (not in this PR)

  • apk8s codex-cli HelmRelease tag → `codex-latest`
  • New apk8s claude-cli app pointing at `claude-latest`, ExternalSecret pulling `claude-github-pat` from the Kubernetes vault, HTTPRoute for `claude.prodromou.com` on envoy-internal

🤖 Generated with Claude Code

nprodromou added 3 commits May 6, 2026 22:39
…ig sync

Genericizes the image so the same Dockerfile produces both the
codex variant and a new claude variant, each with its own GHCR tag.

Dockerfile
- ARG AGENT=codex|claude validated at build time
- Per-agent npm install (@openai/codex vs @anthropic-ai/claude-code)
- Non-root user named after AGENT (uid/gid 1000), HOME=/home/${AGENT}
- npm upgraded to latest after Node install — NodeSource lags

Entrypoint
- Per-agent env-var contract: codex requires GH_TOKEN + optional
  CODEX_SESSION (seeds auth.json); claude requires only GH_TOKEN
  (interactive /login on first connect, persists on PVC)
- Auto-resume on connect: `codex resume --last` / `claude --continue`,
  falling through to a fresh agent run if no session exists, then to
  bash if the agent exits
- Pulls nprodromou/agent-config and symlinks instructions/CLAUDE.md
  into the agent's expected path: ~/.codex/AGENTS.md or
  ~/.claude/CLAUDE.md. Pulls fresh on every restart, no image rebuild
  needed for instruction updates.

Workflow
- Build matrix on agent: [codex, claude]
- Per-agent tags only (no shared :latest): codex-latest, claude-latest,
  sha-XXXXX-codex, sha-XXXXX-claude, etc. Fresh tag namespace avoids
  the kubelet/containerd cache wedge that froze :latest on the older
  bubblewrap-less image.
- Independent gha cache scopes per agent (cache-from/cache-to)

Apk8s manifests still need updating to point at codex-latest
(instead of :latest) and to add a parallel claude-cli app — separate
follow-up PR.
@nprodromou nprodromou merged commit 612447c into main May 7, 2026
2 checks passed
@nprodromou nprodromou deleted the refactor/multi-agent-build branch May 7, 2026 05:45
claude-prodromou added a commit that referenced this pull request May 11, 2026
…#21)

* codex-shell: AGENT_MODE=smoke-test for slot startup probe (WOVED-147)

Second slice of WOVED-147 after the uid pin (#20). The slot model
assumes auth credentials remain usable across image rotations — but
four failure modes can break that silently. The uid pin defends one
(#3); this script defends the other three at first-boot:

  #1 refresh token expired on the wall clock
  #2 CLI auth format changed incompatibly
  #4 stricter cred-format check on a newer CLI version

bin/smoke_test.py:
  - Verifies the agent's CLI binary loads (`<binary> --version` exits
    0 within 10s) — catches image regressions at the binary layer.
  - Verifies the credentials file exists at the expected path,
    is non-empty, and parses as JSON — cheapest "format sanity"
    check that catches #2 and #4 without making any network call.
  - Exits with structured codes: 0 (ready), 64 (CLI broken),
    65 (creds missing — slot needs init), 66 (creds invalid —
    slot needs re-auth). Manager-side dispatch keys off these
    values; do not renumber without bumping Manager in lockstep.
  - Stdlib only — same constraint as worker.py + auth_init.py.
    The slot pod's startup probe runs early in boot, before any
    pip would have a chance to land.

bin/entrypoint.sh:
  - Adds `smoke-test)` case to the AGENT_MODE dispatch.
  - Documents required env (WOVED_TASK_AGENT) + the four exit
    codes in the case body so an operator reading the entrypoint
    sees the contract without grepping for smoke_test.py.

Dockerfile:
  - COPY the new script to /usr/local/bin alongside auth_init.py.

Manifest-level wiring (kubernetes startupProbe on slot worker pods)
lands in nprodromou/woved alongside WOVED-152 (worker-job slot
mount), so the chart template has somewhere to attach the probe.
Without WOVED-152 there are no slot worker pods to probe.

Local end-to-end exercise on the dev machine confirmed all five
exit-code paths (missing env / unmapped agent / creds missing /
creds invalid / OK) — the OK path even picked up the real
claude CLI's version string in the structured output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* smoke-test: claude detection matches auth-init snapshot model

Codex caught (codex-shell#21 review) that pinning the claude
credential path to ~/.claude/credentials.json would false-fail
healthy slots whose CLI wrote to e.g. ~/.claude/.credentials/session.json.
auth_init.py deliberately does NOT pin a filename for exactly this
reason — it uses snapshot-diff over the entire ~/.claude/ tree to
detect "auth happened" robustly across CLI version changes.

Smoke test now matches that model for claude:

  - Walk ~/.claude/ for any non-symlink regular file outside the
    entrypoint's pre-populated baseline (CLAUDE.md, config.toml,
    settings.json — names empirically copied in by entrypoint.sh
    BEFORE auth-init runs).
  - Any candidate file → creds-ok (exit 0).
  - No candidates → creds-missing (exit 65).
  - Walk failure (permission denied, etc.) → creds-invalid (exit
    66, same shape as Codex stat() failure path).

Codex CLI side stays pinned (~/.codex/auth.json) — Codex CLI
contract is stable on that path AND the entrypoint writes there
from CODEX_SESSION at first boot. Codex's review specifically
flagged only the claude side.

No JSON parse for claude — auth_init.py doesn't parse either,
because the format may differ across CLI versions and a
parse-failure on a real-but-unfamiliar artifact would be a worse
failure than a false-pass on a corrupt one (which the next real
task would catch immediately). Codex JSON parse stays because the
codex CLI contract IS stable.

Local exercise of all six cases (missing env / no ~/.claude /
empty .claude / baseline-only / .credentials/session.json /
legacy credentials.json) confirmed correct exit codes. The exact
repro from Codex's review (a file at .credentials/session.json)
now exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant