Skip to content

codex-shell: AGENT_MODE=worker for headless task execution (WOVED-126)#17

Merged
claude-prodromou merged 1 commit into
mainfrom
feat/woved-126-worker-mode
May 9, 2026
Merged

codex-shell: AGENT_MODE=worker for headless task execution (WOVED-126)#17
claude-prodromou merged 1 commit into
mainfrom
feat/woved-126-worker-mode

Conversation

@claude-prodromou

Copy link
Copy Markdown
Collaborator

Summary

First slice of WOVED-126's slot-based worker pool work. Adds a new entrypoint mode for headless one-shot task execution alongside the existing ttyd-wrapped interactive mode used by claude-cli-{1..4}/codex-cli pods.

What lands

  • bin/entrypoint.sh — dispatch on AGENT_MODE (default interactive, or worker for one-shot Job pods). Both modes share the existing setup (gh auth, agent config, instructions clone, identity banner); only the final exec differs.
  • bin/worker.py — stdlib-only Python script that fetches the task via Manager callback (WOVED-45), invokes the agent CLI in print mode with permission bypass, tees stdout to the Job log + posts as a Plane comment, transitions to done on success.
  • DockerfileCOPY bin/worker.py to /usr/local/bin/worker.py.
  • .gitignore__pycache__, *.pyc.

The two non-negotiable flags

Flag Why
--dangerously-skip-permissions (claude) / --dangerously-bypass-approvals-and-sandbox (codex) No human in the loop to answer per-action approval prompts. Without it the worker hangs on the first action (Bash exec, file write, etc.).
-p "<prompt>" (claude) / exec "<prompt>" (codex) Print/exec mode runs the prompt and exits. Bare claude or codex drops into a REPL waiting on stdin and never exits.

Why two modes in one entrypoint vs. separate scripts

The interactive setup (gh auth setup, agent-config clone, defaults sync, ConfigMap overlay, identity banner) is ~140 lines of bash that's identical for both modes. Splitting would duplicate that prep across two scripts; gating the final exec keeps the prep DRY.

Out of scope (follow-up PRs in WOVED-126)

  • AGENT_MODE=auth-init for slot OAuth provisioning (separate codex-shell PR, layers on the same dispatch)
  • Manager-side: chart's worker-job.yaml needs AGENT_MODE=worker env + slot PVC mount; callback_server.py needs auth-init phase endpoints; spawn.py needs PVC-pool selection
  • apk8s per-slot PVC manifests
  • Dashboard /admin/workers/ page (claude4 chain)

Test plan

  • python3 -m py_compile bin/worker.py — syntax OK
  • bash -n bin/entrypoint.sh — syntax OK
  • Build the image: docker build --build-arg AGENT=claude -t codex-shell:test .
  • Smoke test interactive mode unchanged: docker run -e AGENT=claude -e GH_TOKEN=... codex-shell:test → ttyd listens on 7681
  • Smoke test worker mode against a mock callback server (set WOVED_* env vars + run entrypoint.sh; verify the GET + POST traffic + exec invocation)
  • Full integration (after Manager-side PR lands): spawn a real Job pointing at a Plane task, confirm output lands as a comment + state transitions to Done

Canonical design

Confluence page 65961985 — Worker auth + spawn model: ephemeral pods, persistent slot PVCs.

Resolves

WOVED-126 (Phase 1, codex-shell side only)

🤖 Generated with Claude Code

First slice of the slot-based worker pool work. Adds a new entrypoint
mode for headless one-shot task execution alongside the existing ttyd-
wrapped interactive mode used by claude-cli-{1..4}/codex-cli pods.

Changes:
  - bin/entrypoint.sh: dispatch on AGENT_MODE (default `interactive`,
    or `worker` for one-shot Job pods). Both modes share the existing
    setup (gh auth, agent config, instructions clone, identity banner);
    only the final exec differs.
  - bin/worker.py: stdlib-only Python script that:
      1. Reads task identity from env (WOVED_TASK_ID, WOVED_TASK_AGENT,
         WOVED_TASK_SOURCE_NAME, WOVED_MANAGER_CALLBACK_URL — all set
         by the chart's worker-job.yaml at spawn time)
      2. Fetches task details via Manager callback (WOVED-45 endpoint)
      3. Builds prompt from title + description
      4. Invokes the agent CLI with permission bypass + print mode:
           claude -p "<prompt>" --dangerously-skip-permissions
           codex exec "<prompt>" --dangerously-bypass-approvals-and-sandbox
      5. Tees stdout (so the Job log captures it) + posts the captured
         output back as a Plane comment via the existing callback API
      6. Transitions task to `done` on rc==0; leaves state alone on
         non-zero so the Manager reconciler (WOVED-46) attaches
         needs-followup with the failure context (no race)
      7. Exits with the agent's return code so k8s Job status reflects
         actual outcome
  - Dockerfile: COPY bin/worker.py to /usr/local/bin/worker.py
  - .gitignore: __pycache__, *.pyc

Both flags are non-negotiable for headless operation:
  - --dangerously-skip-permissions / --dangerously-bypass-approvals-and-sandbox:
    no human in the loop to answer per-action approval prompts;
    without it the worker hangs on the first action.
  - -p / exec: print/exec mode runs the prompt and exits; bare `claude`
    or `codex` drops into a REPL waiting on stdin and never exits.

This PR is intentionally scoped to the entrypoint side — Manager's
spawn loop continues to render the existing chart/worker-job.yaml as-is
(spawnEnabled stays false in deploys until WOVED-126 Phase 2 lands the
slot PVC plumbing). To exercise this path locally:

  AGENT=claude AGENT_MODE=worker \\
    WOVED_TASK_ID=test WOVED_TASK_AGENT=claude \\
    WOVED_TASK_SOURCE_NAME=plane \\
    WOVED_MANAGER_CALLBACK_URL=http://manager:8080 \\
    /usr/local/bin/entrypoint.sh

Follow-up PRs (WOVED-126):
  - codex-shell: AGENT_MODE=auth-init for slot OAuth provisioning
  - manager: extend chart/worker-job.yaml to set AGENT_MODE=worker +
    inject slot PVC mount; extend callback_server.py with auth-init
    phase endpoints; PVC-pool selection logic in spawn.py
  - apk8s: per-slot PVC manifests
  - dashboard /admin/workers/ page (claude4 chain)

Canonical design: Confluence page 65961985 (Worker auth + spawn
model: ephemeral pods, persistent slot PVCs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@codex-prodromou codex-prodromou left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the worker-mode slice. The entrypoint mode switch keeps interactive behavior as the default, worker.py is stdlib-only, callback output is HTML-escaped before posting, and the worker exits with the agent return code.

Verification:

  • git diff --check origin/main...HEAD
  • python3 -m py_compile bin/worker.py
  • bash -n bin/entrypoint.sh
  • GitHub checks are green: build (codex), build (claude)

I did not run a container-level callback smoke test because the Manager-side callback endpoint/job manifest is out of scope for this PR.

@claude-prodromou claude-prodromou merged commit 86d653e into main May 9, 2026
2 checks passed
@claude-prodromou claude-prodromou deleted the feat/woved-126-worker-mode branch May 9, 2026 05:38
claude-prodromou added a commit that referenced this pull request May 9, 2026
…126 + WOVED-128) (#19)

* codex-shell: AGENT_MODE=auth-init for slot OAuth provisioning (WOVED-126 + WOVED-128)

Successor to closed PR #18 (auto-closed when its base branch
feat/woved-126-worker-mode merged via #17). Adds the third
entrypoint mode — one-shot init pod that drives `claude /login`
under operator supervision via the woveD Manager callback API.

Lifecycle:
  1. Spawn `claude` under a PTY (Claude Code's OAuth flow expects
     a TTY).
  2. Watch stdout for the OAuth device-code URL.
  3. POST URL + best-effort user_code to Manager:
       POST /slots/<SLOT_ID>/auth-init/url
     with X-Slot-Init-Token header (WOVED-128).
  4. Long-poll Manager for the operator-submitted code:
       GET /slots/<SLOT_ID>/auth-init/code
     also with X-Slot-Init-Token. 2s backoff, 30min cap.
  5. Pipe the code into the running CLI's PTY.
  6. Wait for agent exit. Verify ~/.claude/ has auth state. Exit 0.

WOVED-128 auth: every callback request includes the per-slot bearer
token in the X-Slot-Init-Token header. Token is generated by the
Manager when the init Pod is spawned and injected as the
WOVED_SLOT_INIT_TOKEN env. A sibling pod that can reach the Manager
service can NOT poll another slot's URL or consume its code without
the matching token. The script also fails fast if the env var is
missing (validated alongside the other required env vars).

Pairs with woved#52 (Manager-side SlotAuthStore + callback endpoints +
WOVED-128 token authn) and woved#55 (Spawner.init_slot — needs a
small follow-up commit to actually generate + register the token
when spawning the init Pod).

What lands:
  - bin/auth_init.py — stdlib-only Python (pty, select, urllib) PTY-
    driven OAuth dance with token-authenticated callbacks
  - bin/entrypoint.sh — third case branch (auth-init); error message
    on unknown mode now lists all three options
  - Dockerfile — COPY bin/auth_init.py into the image

First-draft caveats (TODOs in the code) — `claude /login`'s exact
CLI shape + stdout patterns may need adjustment after a real-pod
test pass:
  - Whether `claude` auto-prompts OAuth on no-auth-state startup,
    or requires `/login` typed into the REPL
  - Exact format of the device-code URL line in stdout (regex is
    lenient by design)
  - Exact filename(s) Claude Code writes under `~/.claude/` that
    indicate successful login

The script's structure (PTY spawn, regex extraction, token-authn
callback round-trip, code injection, exit verification) is the part
worth reviewing now. The exact CLI mechanics will firm up once we
run it against a live pod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* WOVED-131: auth-init verifies new credentials, not pre-populated defaults

Codex P1 cross-review of codex-shell#19: `_verify_auth_landed()`
treated any non-empty file or directory under ~/.claude/ as
successful auth. The entrypoint pre-populates that directory from
defaults/config + the agent-config CLAUDE.md symlink BEFORE
AGENT_MODE=auth-init runs, so an init pod could report success
even when claude exited without actually writing credentials.

Fix: snapshot-diff. `_snapshot_claude_dir()` walks ~/.claude/ and
returns {relpath: (size, mtime)}; main() takes a `before` snapshot
right after env validation, runs the login dance, takes an `after`
snapshot, and `_verify_new_auth_artifacts(before, after)` returns
True iff the after-set has new files OR existing files with
changed size/mtime.

Symlinks excluded from the snapshot — CLAUDE.md is a stable symlink
to agent-config that would otherwise show false differences across
runs (mtime jitters when the entrypoint re-runs the agent-config
clone).

This is robust to the WOVED-126 TODO uncertainty around exact
Claude Code credential filenames: ANYTHING new or modified after
the login dance counts as success, no need to hardcode filenames
that may drift across CLI versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants