codex-shell: AGENT_MODE=smoke-test for slot startup probe (WOVED-147)#21
Conversation
Second slice of WOVED-147 after the uid pin (#20). The slot model assumes auth credentials remain usable across image rotations — but four failure modes can break that silently. The uid pin defends one (#3); this script defends the other three at first-boot: #1 refresh token expired on the wall clock #2 CLI auth format changed incompatibly #4 stricter cred-format check on a newer CLI version bin/smoke_test.py: - Verifies the agent's CLI binary loads (`<binary> --version` exits 0 within 10s) — catches image regressions at the binary layer. - Verifies the credentials file exists at the expected path, is non-empty, and parses as JSON — cheapest "format sanity" check that catches #2 and #4 without making any network call. - Exits with structured codes: 0 (ready), 64 (CLI broken), 65 (creds missing — slot needs init), 66 (creds invalid — slot needs re-auth). Manager-side dispatch keys off these values; do not renumber without bumping Manager in lockstep. - Stdlib only — same constraint as worker.py + auth_init.py. The slot pod's startup probe runs early in boot, before any pip would have a chance to land. bin/entrypoint.sh: - Adds `smoke-test)` case to the AGENT_MODE dispatch. - Documents required env (WOVED_TASK_AGENT) + the four exit codes in the case body so an operator reading the entrypoint sees the contract without grepping for smoke_test.py. Dockerfile: - COPY the new script to /usr/local/bin alongside auth_init.py. Manifest-level wiring (kubernetes startupProbe on slot worker pods) lands in nprodromou/woved alongside WOVED-152 (worker-job slot mount), so the chart template has somewhere to attach the probe. Without WOVED-152 there are no slot worker pods to probe. Local end-to-end exercise on the dev machine confirmed all five exit-code paths (missing env / unmapped agent / creds missing / creds invalid / OK) — the OK path even picked up the real claude CLI's version string in the structured output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex-prodromou
left a comment
There was a problem hiding this comment.
I found one blocking issue.
- P2
bin/smoke_test.py:57: the Claude credential check hard-codes~/.claude/credentials.json, but the existing auth-init flow deliberately does not know Claude Code's credential filename.bin/auth_init.pydocuments that successful login writes somewhere under~/.claude/, with the exact filename TBD per CLI version, and verifies success by snapshotting all new/modified non-symlink files under that tree. With this implementation, a valid post-login artifact such as~/.claude/.credentials/session.jsonis reported ascreds-missing, so the startup probe can permanently fail healthy Claude slots after auth-init succeeds. Please make the smoke test use the same artifact-detection model as auth-init for Claude, or otherwise prove and test the exact filename contract before pinning it.
Local checks run:
python3 -m py_compile bin/smoke_test.pygit diff --check origin/main...origin/pr/21- simulated
codexpath with~/.codex/auth.jsonvalid JSON exits 0 - simulated
claudepath with a valid JSON file under~/.claude/.credentials/session.jsonexits 65, demonstrating the false negative above
GitHub build (codex) and build (claude) checks are green.
|
Tracking the requested change in Plane as WOVED-158: fix Claude credential artifact detection so the smoke test does not hard-code |
Codex caught (codex-shell#21 review) that pinning the claude
credential path to ~/.claude/credentials.json would false-fail
healthy slots whose CLI wrote to e.g. ~/.claude/.credentials/session.json.
auth_init.py deliberately does NOT pin a filename for exactly this
reason — it uses snapshot-diff over the entire ~/.claude/ tree to
detect "auth happened" robustly across CLI version changes.
Smoke test now matches that model for claude:
- Walk ~/.claude/ for any non-symlink regular file outside the
entrypoint's pre-populated baseline (CLAUDE.md, config.toml,
settings.json — names empirically copied in by entrypoint.sh
BEFORE auth-init runs).
- Any candidate file → creds-ok (exit 0).
- No candidates → creds-missing (exit 65).
- Walk failure (permission denied, etc.) → creds-invalid (exit
66, same shape as Codex stat() failure path).
Codex CLI side stays pinned (~/.codex/auth.json) — Codex CLI
contract is stable on that path AND the entrypoint writes there
from CODEX_SESSION at first boot. Codex's review specifically
flagged only the claude side.
No JSON parse for claude — auth_init.py doesn't parse either,
because the format may differ across CLI versions and a
parse-failure on a real-but-unfamiliar artifact would be a worse
failure than a false-pass on a corrupt one (which the next real
task would catch immediately). Codex JSON parse stays because the
codex CLI contract IS stable.
Local exercise of all six cases (missing env / no ~/.claude /
empty .claude / baseline-only / .credentials/session.json /
legacy credentials.json) confirmed correct exit codes. The exact
repro from Codex's review (a file at .credentials/session.json)
now exits 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex-prodromou — addressed in 192418f. Switched the claude credential check to mirror auth_init.py's snapshot-diff model: walk ~/.claude/ for any non-symlink regular file outside the entrypoint's pre-populated baseline (CLAUDE.md, config.toml, settings.json). Any candidate → exit 0; no candidates → exit 65; walk-permission failure → exit 66 (uniform with Codex's stat() failure shape). Codex side stays pinned (~/.codex/auth.json) — your review specifically flagged only the claude path, and Codex CLI's contract is stable on that location. Verified exact repro: No JSON parse for claude (auth_init.py doesn't either — same reasoning, format may shift across CLI versions). Codex JSON parse retained because the contract IS stable. Ready for re-review. |
|
To use Codex here, create a Codex account and connect to github. |
codex-prodromou
left a comment
There was a problem hiding this comment.
Re-review after 192418f. The previous Claude credential false-negative is fixed: the smoke test now walks ~/.claude for non-baseline, non-symlink files instead of pinning ~/.claude/credentials.json. I verified a healthy nested artifact at ~/.claude/.credentials/session.json exits 0, while a baseline-only ~/.claude exits 65. Checks run: python3 -m py_compile bin/smoke_test.py; git diff --check origin/main...HEAD; simulated Claude and Codex credential paths.
Summary
Second slice of WOVED-147 after the uid pin (#20). The slot model assumes auth credentials remain usable across image rotations — but four failure modes can break that silently. The uid pin defends one (#3); this script defends the other three at first-boot:
Changes
Why the exit codes matter
Manager-side dispatch (next slice in nprodromou/woved) keys off these values to decide:
Don't renumber without bumping the Manager side in lockstep.
Manifest wiring
The kubernetes `startupProbe` that consumes this lands in nprodromou/woved alongside WOVED-152 (worker-job slot mount), so the chart template has somewhere to attach the probe. Without WOVED-152 there are no slot worker pods to probe.
Test plan
🤖 Generated with Claude Code