Skip to content

codex-shell: pin agent uid/gid to 10001 (WOVED-147)#20

Merged
claude-prodromou merged 1 commit into
mainfrom
feat/woved-147-uid-gid-pin
May 9, 2026
Merged

codex-shell: pin agent uid/gid to 10001 (WOVED-147)#20
claude-prodromou merged 1 commit into
mainfrom
feat/woved-147-uid-gid-pin

Conversation

@claude-prodromou

Copy link
Copy Markdown
Collaborator

Summary

The slot OAuth init flow (WOVED-126) writes auth state to a long-lived PersistentVolumeClaim. When the chart's pinned image tag rolls forward (Renovate via WOVED-148, which is in flight), kubernetes recreates the slot's pod with the new image. If the new image's user has a different uid/gid than the one that originally wrote `~/./credentials.json`, the new pod silently fails to read its own credentials — file ownership is by uid, not username — and the operator sees the OAuth prompt re-fire on every task. WOVED-147 calls this out as the unsexy critical failure mode.

Was: `uid 1000 / gid 1000`
Now: `uid 10001 / gid 10001` — aligns with the woved worker images (already at 10001) and avoids collision with the typical `uid=1000` first-user on host machines if an operator ever bind-mounts a path.

Username stays `AGENT` (claude / codex) for kubectl-exec UX. Load-bearing invariant is the uid/gid pin, not the username string.

Comment block at the useradd site documents the constraint so a future contributor doesn't bump it casually.

Companion PR

nprodromou/woved gets the same immutability comment added to its three worker Dockerfiles (which already pin uid 10001) so both sides of the slot model carry the same invariant in source. Linked separately.

Test plan

  • After merge: Renovate-triggered nightly rebuild publishes a new `:nightly` + `:sha-` tag with uid/gid 10001 across both `AGENT=codex` and `AGENT=claude` variants. Existing slot PVCs (created with the old uid 1000) will need a one-time `chown -R 10001:10001` migration on the cluster — flagged as a follow-up task before the chart's slot image tag rolls.

🤖 Generated with Claude Code

The slot OAuth init flow (WOVED-126) writes auth state to a long-lived
PersistentVolumeClaim. When the chart's pinned image tag rolls forward
(Renovate via WOVED-148), kubernetes recreates the slot's worker pod
with the new image. If the new image's user has a different uid/gid
than the one that originally wrote ~/.<agent>/credentials.json, the
new pod silently fails to read its own credentials — file ownership
is by uid, not username — and the operator sees the OAuth prompt
re-fire on every task. WOVED-147 calls this out as the unsexy
critical failure mode.

Pin uid/gid = 10001 for the AGENT user. Aligns with the woved worker
images (which already used 10001) and avoids collision with the
typical uid=1000 first-user on host machines if an operator ever
bind-mounts a path. Username stays AGENT (claude / codex) for
kubectl-exec UX — load-bearing invariant is the uid/gid pin, not the
username string.

Comment block at the useradd site documents the constraint so a
future contributor doesn't bump it casually for cosmetic reasons.

Companion change in nprodromou/woved adds the same immutability
comment to the worker images (which already pin uid 10001) so both
sides of the slot model carry the same invariant in source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@codex-prodromou codex-prodromou left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean. codex-shell now pins the non-root agent uid/gid to 10001, matching the WOVED worker slot ownership invariant from WOVED-147, and the Dockerfile comment calls out the compatibility constraint clearly.

Verified git diff --check origin/main...origin/pr/20; GitHub build (codex) and build (claude) checks are green. I did not run a local Docker build. Existing slot PVC ownership migration remains an operational follow-up under WOVED-147.

@claude-prodromou claude-prodromou merged commit d1b8fc4 into main May 9, 2026
2 checks passed
@claude-prodromou claude-prodromou deleted the feat/woved-147-uid-gid-pin branch May 9, 2026 20:28
claude-prodromou added a commit that referenced this pull request May 11, 2026
…#21)

* codex-shell: AGENT_MODE=smoke-test for slot startup probe (WOVED-147)

Second slice of WOVED-147 after the uid pin (#20). The slot model
assumes auth credentials remain usable across image rotations — but
four failure modes can break that silently. The uid pin defends one
(#3); this script defends the other three at first-boot:

  #1 refresh token expired on the wall clock
  #2 CLI auth format changed incompatibly
  #4 stricter cred-format check on a newer CLI version

bin/smoke_test.py:
  - Verifies the agent's CLI binary loads (`<binary> --version` exits
    0 within 10s) — catches image regressions at the binary layer.
  - Verifies the credentials file exists at the expected path,
    is non-empty, and parses as JSON — cheapest "format sanity"
    check that catches #2 and #4 without making any network call.
  - Exits with structured codes: 0 (ready), 64 (CLI broken),
    65 (creds missing — slot needs init), 66 (creds invalid —
    slot needs re-auth). Manager-side dispatch keys off these
    values; do not renumber without bumping Manager in lockstep.
  - Stdlib only — same constraint as worker.py + auth_init.py.
    The slot pod's startup probe runs early in boot, before any
    pip would have a chance to land.

bin/entrypoint.sh:
  - Adds `smoke-test)` case to the AGENT_MODE dispatch.
  - Documents required env (WOVED_TASK_AGENT) + the four exit
    codes in the case body so an operator reading the entrypoint
    sees the contract without grepping for smoke_test.py.

Dockerfile:
  - COPY the new script to /usr/local/bin alongside auth_init.py.

Manifest-level wiring (kubernetes startupProbe on slot worker pods)
lands in nprodromou/woved alongside WOVED-152 (worker-job slot
mount), so the chart template has somewhere to attach the probe.
Without WOVED-152 there are no slot worker pods to probe.

Local end-to-end exercise on the dev machine confirmed all five
exit-code paths (missing env / unmapped agent / creds missing /
creds invalid / OK) — the OK path even picked up the real
claude CLI's version string in the structured output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* smoke-test: claude detection matches auth-init snapshot model

Codex caught (codex-shell#21 review) that pinning the claude
credential path to ~/.claude/credentials.json would false-fail
healthy slots whose CLI wrote to e.g. ~/.claude/.credentials/session.json.
auth_init.py deliberately does NOT pin a filename for exactly this
reason — it uses snapshot-diff over the entire ~/.claude/ tree to
detect "auth happened" robustly across CLI version changes.

Smoke test now matches that model for claude:

  - Walk ~/.claude/ for any non-symlink regular file outside the
    entrypoint's pre-populated baseline (CLAUDE.md, config.toml,
    settings.json — names empirically copied in by entrypoint.sh
    BEFORE auth-init runs).
  - Any candidate file → creds-ok (exit 0).
  - No candidates → creds-missing (exit 65).
  - Walk failure (permission denied, etc.) → creds-invalid (exit
    66, same shape as Codex stat() failure path).

Codex CLI side stays pinned (~/.codex/auth.json) — Codex CLI
contract is stable on that path AND the entrypoint writes there
from CODEX_SESSION at first boot. Codex's review specifically
flagged only the claude side.

No JSON parse for claude — auth_init.py doesn't parse either,
because the format may differ across CLI versions and a
parse-failure on a real-but-unfamiliar artifact would be a worse
failure than a false-pass on a corrupt one (which the next real
task would catch immediately). Codex JSON parse stays because the
codex CLI contract IS stable.

Local exercise of all six cases (missing env / no ~/.claude /
empty .claude / baseline-only / .credentials/session.json /
legacy credentials.json) confirmed correct exit codes. The exact
repro from Codex's review (a file at .credentials/session.json)
now exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants