Skip to content

Pin agent CLI versions; harden managed-config sync (OPS-409, OPS-406)#10

Open
nprodromou wants to merge 1 commit into
mainfrom
ops/pin-cli-versions-and-fix-config-copy
Open

Pin agent CLI versions; harden managed-config sync (OPS-409, OPS-406)#10
nprodromou wants to merge 1 commit into
mainfrom
ops/pin-cli-versions-and-fix-config-copy

Conversation

@nprodromou

Copy link
Copy Markdown
Owner

Summary

  • OPS-409: Pin @openai/codex (0.129.0) and @anthropic-ai/claude-code (2.1.132) via Dockerfile ARGs, record in image LABELs + ENV, so rebuilds of the same commit produce the same agent CLI behavior.
  • OPS-406: Replace cp -fL ... 2>/dev/null || true with cp -afL and exit FATAL on failure. Adds a smoke check that detects the non-empty-source / empty-destination case and fails the pod start instead of running with stale config.

Both originated as Codex P2 findings on PRs #2-#8.

Test plan

  • Image builds for both AGENT=codex and AGENT=claude
  • docker inspect <image> --format '{{ index .Config.Labels "com.prodromou.codex-shell.codex-cli-version" }}' returns the pinned version
  • Pod startup banner shows correct CLI version
  • Negative test: simulate empty AGENT_CONFIG_DIR after copy → entrypoint exits 1 with FATAL message
  • Positive test: deploy to apk8s agents namespace, confirm ConfigMap content (including subdirs) lands in /home/<agent>/.<agent>/

OPS-409: Pin @openai/codex and @anthropic-ai/claude-code to explicit
versions via ARGs (CODEX_CLI_VERSION, CLAUDE_CLI_VERSION). Versions
also recorded in image LABELs and exported as ENV so the runtime
banner can confirm what shipped. Rebuilding the same commit no longer
silently picks up a newer agent CLI.

OPS-406: Replace `cp -fL ... 2>/dev/null || true` with `cp -afL` (-a
recurses + preserves attrs, -L dereferences ConfigMap symlinks).
Failures now exit with a clear FATAL message instead of being
masked. Adds a smoke check: if the ConfigMap mount is non-empty but
the destination ends up empty, fail loudly so stale managed config
can no longer ride a successful pod start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@codex-prodromou codex-prodromou left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking findings. This addresses the two Codex findings from OPS-406 and OPS-409: agent CLI npm packages are pinned, image metadata records the versions, the ConfigMap copy is recursive/dereferencing, and failures no longer get hidden. Build matrix passes for both codex and claude images.

claude-prodromou added a commit to claude-prodromou/codex-shell that referenced this pull request May 7, 2026
Codex review (CHANGES_REQUESTED): cp -fL without -R skipped
subdirectories, so the baked defaults wouldn't actually land in the
runtime config dir. Same root cause as OPS-406 (codex-shell#10).

Applies the hardened pattern from nprodromou#10 to BOTH config-copy layers:

  Layer 1 — image defaults (/etc/<agent>-defaults/):
    cp -afL with FATAL exit on failure + smoke check that catches
    silent permission/path failures.

  Layer 2 — ConfigMap overlay (/etc/<agent>-config/):
    Same pattern. Will rebase cleanly on top of nprodromou#10 (or vice versa)
    since the changes are textually identical.

Both layers now fail loudly instead of silently masking missing
config — same defense-in-depth as the OPS-406 fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-prodromou added a commit that referenced this pull request May 8, 2026
)

* OPS-405: bake codex runtime defaults into image, layered with ConfigMap

Adds defaults/codex-config.toml carrying a sensible baseline for the
codex variant:

  sandbox_mode    = "danger-full-access"   # pod is the security boundary
  approval_policy = "on-failure"           # no per-command prompts
  [projects."/home/codex/workspace"] trust_level = "trusted"

Why these values: the apk8s pod itself is the security sandbox (non-root
user, restricted RBAC, PVC isolation). Codex's internal bubblewrap layer
is redundant in this deployment AND was failing on
`bwrap: No permissions to create new namespace` because most hardened
k8s clusters block unprivileged user-namespace cloning. Disabling Codex's
inner sandbox eliminates the per-command escalation that OPS-405 calls
out as noisy.

Dockerfile copies defaults/<agent>-config.toml into /etc/<agent>-defaults/
during build (only the file matching the AGENT arg gets installed).

Entrypoint now layers two sources into ${AGENT_CONFIG_DIR}:
  Layer 1: /etc/<agent>-defaults/  — image baseline (this commit)
  Layer 2: /etc/<agent>-config/    — apk8s ConfigMap (existing, wins)

Per-deployment tweaks still go in the apk8s ConfigMap; this baseline
just means a fresh pod without a ConfigMap is still functional.

Note on full fix scope: OPS-405's apk8s ConfigMap update remains a
separate Nate-action — the live pods today have a ConfigMap mounted,
which means this image-default doesn't reach them until either the
pods are rebuilt without their ConfigMap or the ConfigMap content is
updated to match this baseline. The values here can be copied into the
apk8s ConfigMap directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* OPS-405: harden config-copy on both layers (cp -afL + smoke check)

Codex review (CHANGES_REQUESTED): cp -fL without -R skipped
subdirectories, so the baked defaults wouldn't actually land in the
runtime config dir. Same root cause as OPS-406 (codex-shell#10).

Applies the hardened pattern from #10 to BOTH config-copy layers:

  Layer 1 — image defaults (/etc/<agent>-defaults/):
    cp -afL with FATAL exit on failure + smoke check that catches
    silent permission/path failures.

  Layer 2 — ConfigMap overlay (/etc/<agent>-config/):
    Same pattern. Will rebase cleanly on top of #10 (or vice versa)
    since the changes are textually identical.

Both layers now fail loudly instead of silently masking missing
config — same defense-in-depth as the OPS-406 fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants