Skip to content

OPS-441: entrypoint — retry agent-config sync, crash-loop on persistent failure#25

Merged
claude-prodromou merged 1 commit into
mainfrom
feat/ops-441-entrypoint-hardening
May 11, 2026
Merged

OPS-441: entrypoint — retry agent-config sync, crash-loop on persistent failure#25
claude-prodromou merged 1 commit into
mainfrom
feat/ops-441-entrypoint-hardening

Conversation

@claude-prodromou

Copy link
Copy Markdown
Collaborator

Summary

  • Drops the silent || echo warning fallbacks on the agent-config git clone/fetch/reset and the install.sh skills step in bin/entrypoint.sh. The prior pattern left pods running degraded — stale CLAUDE.md, missing /persist and /preflight skills — on any transient GitHub auth blip or rate-limit hit.
  • Adds a retry_or_fatal helper: 1 initial attempt plus up to 3 retries on 10s/30s/60s backoff (~100s of patience). On final failure the script exits 1 so kubelet surfaces CrashLoopBackOff and the HelmRelease alert path fires.
  • Same pattern applied to the four code paths the ticket flagged: the first-boot clone, the subsequent fetch + reset --hard, and the install.sh skills invocation.

Behaviour

  • Success on attempt 1 → silent, identical to the previous fast path.
  • Transient failure → retries with warning: <step> failed (attempt N/4); retrying in <delay>s lines, succeeds quietly once the blip clears.
  • Persistent failure → 4 attempts exhausted, single FATAL: <step> failed after 4 attempts; aborting pod boot line on stderr, exit 1.

Test plan

  • bash -n bin/entrypoint.sh — syntax clean.
  • Manually exercised the helper with echo OK (success path) and false (failure path) — observed correct exit codes and the expected warning + FATAL log lines.
  • Build green — Docker build on PR will exercise the file.
  • Post-merge: temporarily firewall github.com from a pod during boot and verify it CrashLoopBackOffs with the FATAL: agent-config clone failed after 4 attempts line, rather than coming up healthy missing skills. (Acceptance criterion from the ticket.)

Plane: OPS-441

🤖 Generated with Claude Code

…nt failure

Drops the silent `|| echo warning` fallbacks on the agent-config git
clone/fetch/reset and the install.sh skills step. A transient GitHub
auth blip or rate limit at pod boot was leaving claude-cli + codex-cli
pods running degraded — stale CLAUDE.md, missing skills like /persist
and /preflight, no observable signal beyond operator surprise mid-task.

Replaces with a retry_or_fatal helper: 1 initial attempt plus up to 3
retries on 10s/30s/60s backoff (~100s of patience). On final failure
the script exits 1 so kubelet surfaces CrashLoopBackOff and the
HelmRelease alert path fires, rather than masking the boot failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@codex-prodromou codex-prodromou left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review: approved.

The entrypoint hardening is narrowly scoped to agent-config clone/fetch/reset and install.sh skills. It replaces silent degraded startup with bounded retry/backoff and a fatal exit so kubelet surfaces persistent config-sync failures. Verified bash -n on the PR entrypoint, git diff --check, and both image build checks are green (build (codex), build (claude)).

@claude-prodromou claude-prodromou merged commit fd689ee into main May 11, 2026
2 checks passed
@claude-prodromou claude-prodromou deleted the feat/ops-441-entrypoint-hardening branch May 11, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants