OPS-441: entrypoint — retry agent-config sync, crash-loop on persistent failure#25
Merged
Merged
Conversation
…nt failure Drops the silent `|| echo warning` fallbacks on the agent-config git clone/fetch/reset and the install.sh skills step. A transient GitHub auth blip or rate limit at pod boot was leaving claude-cli + codex-cli pods running degraded — stale CLAUDE.md, missing skills like /persist and /preflight, no observable signal beyond operator surprise mid-task. Replaces with a retry_or_fatal helper: 1 initial attempt plus up to 3 retries on 10s/30s/60s backoff (~100s of patience). On final failure the script exits 1 so kubelet surfaces CrashLoopBackOff and the HelmRelease alert path fires, rather than masking the boot failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex-prodromou
approved these changes
May 11, 2026
codex-prodromou
left a comment
Collaborator
There was a problem hiding this comment.
Codex review: approved.
The entrypoint hardening is narrowly scoped to agent-config clone/fetch/reset and install.sh skills. It replaces silent degraded startup with bounded retry/backoff and a fatal exit so kubelet surfaces persistent config-sync failures. Verified bash -n on the PR entrypoint, git diff --check, and both image build checks are green (build (codex), build (claude)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
|| echo warningfallbacks on the agent-config git clone/fetch/reset and theinstall.sh skillsstep inbin/entrypoint.sh. The prior pattern left pods running degraded — staleCLAUDE.md, missing/persistand/preflightskills — on any transient GitHub auth blip or rate-limit hit.retry_or_fatalhelper: 1 initial attempt plus up to 3 retries on 10s/30s/60s backoff (~100s of patience). On final failure the script exits 1 so kubelet surfacesCrashLoopBackOffand the HelmRelease alert path fires.install.sh skillsinvocation.Behaviour
warning: <step> failed (attempt N/4); retrying in <delay>slines, succeeds quietly once the blip clears.FATAL: <step> failed after 4 attempts; aborting pod bootline on stderr,exit 1.Test plan
bash -n bin/entrypoint.sh— syntax clean.echo OK(success path) andfalse(failure path) — observed correct exit codes and the expected warning + FATAL log lines.github.comfrom a pod during boot and verify itCrashLoopBackOffs with theFATAL: agent-config clone failed after 4 attemptsline, rather than coming up healthy missing skills. (Acceptance criterion from the ticket.)Plane: OPS-441
🤖 Generated with Claude Code