nodediscovery: make CiliumNode update retry budget configurable by tanle-ux · Pull Request #1 · tanle-ux/cilium

tanle-ux · 2026-06-09T22:45:45Z

What

cilium-agent is hard-coupled to the kube-apiserver on two paths, both of which exit the agent during a control-plane / API server outage, producing a cluster-wide crash-loop. This PR makes the timeout/retry budget on both paths configurable.

Path	Before	New flag
Running agent — `nodediscovery` `CiliumNode` create/update	fixed 10× / 500ms, then `logging.Fatal`	`--cilium-node-update-max-retries`, `--cilium-node-update-retry-backoff`
Cold start — k8s client initial connection (`waitForConn`)	hardcoded 1 min (`connTimeout`), then hive start fails → exit	`--k8s-client-connection-retry-timeout`

Non-positive values fall back to the built-in defaults.

How to configure (flags / Helm `extraArgs` / `cilium-config` ConfigMap)

--cilium-node-update-max-retries=50
--cilium-node-update-retry-backoff=2s
--k8s-client-connection-retry-timeout=10m
--hive-start-timeout=10m            # see note below

Running-agent give-up ≈ max-retries × (per-request time + backoff).
Cold-start wait = min(--k8s-client-connection-retry-timeout, --hive-start-timeout).

⚠️ Temporary testing defaults

For control-plane-outage testing the defaults are set very large so the agent keeps retrying instead of exiting:

cilium-node-update-max-retries = 1000000 (was 10)
k8s-client-connection-retry-timeout = 168h (was 1m)
HiveStartTimeout (pkg/defaults) = 168h (was 5m) — raised so the cold-start wait is not capped by the start-hook timeout.

All three must be reverted (10 / 1m / 5m) before merging upstream.

Changes

pkg/nodediscovery/{cell,nodediscovery}.go + test: config-driven retry count + backoff, <=0 fallback.
pkg/k8s/client/{config,cell}.go: K8sClientConnectionRetryTimeout field + flag; waitForConn uses it.
pkg/defaults/defaults.go: raise HiveStartTimeout default (testing).

Why not just crash and let kubelet restart?

The k8s client self-heals a live connection (30s heartbeat → closeAllConns → apiserver rotation; informers re-watch with backoff), so a surviving agent reconnects on its own when the apiserver returns — keeping its eBPF datapath and informer caches. Crashing throws those away and, during a long outage, can't even complete cold start.

Remaining before upstream

Revert the three temporary large defaults.
Regenerate cmdref docs (make -C Documentation update-cmdref).

Make the cilium-agent's API-server coupling timeouts configurable via
--cilium-node-update-max-retries, --cilium-node-update-retry-backoff and
--k8s-client-connection-retry-timeout, so the agent can tolerate longer API
server outages before exiting. Defaults preserve existing behavior.

🤖 Generated with Claude Code

…able cilium-agent is hard-coupled to the kube-apiserver on two paths, both of which exit the agent during a control-plane / API server outage and drive a cluster-wide crash-loop: 1. Running agent (nodediscovery): updateCiliumNodeResource retries a fixed 10x / 500ms and then logging.Fatal-exits when it cannot write its CiliumNode resource. 2. Cold start (k8s client): waitForConn retries the initial connection for a hardcoded 1 minute (connTimeout) and then fails the hive start hook, exiting the agent before the datapath comes up. Make both budgets configurable so operators can tune how long the agent tolerates an API server outage before giving up: --cilium-node-update-max-retries (CiliumNode create/update attempts) --cilium-node-update-retry-backoff (backoff between attempts) --k8s-client-connection-retry-timeout (cold-start initial-connection wait) Non-positive values fall back to the built-in defaults. NOTE: for control-plane-outage testing the defaults are temporarily set very large so the agent keeps retrying instead of exiting: - cilium-node-update-max-retries: 1000000 (was 10) - k8s-client-connection-retry-timeout: 168h (was 1m) - HiveStartTimeout: 168h (was 5m, raised so the cold-start wait is not capped by the start-hook timeout) Revert all three to their previous values before merging upstream. Adds TestUpdateCiliumNodeResourceConfigurableRetries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: tanle-ux <tanle@roblox.com>

tanle-ux force-pushed the make-ciliumnode-update-retries-configurable branch 3 times, most recently from 13fcbaa to 0eb4b72 Compare June 10, 2026 00:06

tanle-ux force-pushed the make-ciliumnode-update-retries-configurable branch from 0eb4b72 to 649a087 Compare June 10, 2026 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodediscovery: make CiliumNode update retry budget configurable#1

nodediscovery: make CiliumNode update retry budget configurable#1
tanle-ux wants to merge 1 commit into
mainfrom
make-ciliumnode-update-retries-configurable

tanle-ux commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanle-ux commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How to configure (flags / Helm extraArgs / cilium-config ConfigMap)

⚠️ Temporary testing defaults

Changes

Why not just crash and let kubelet restart?

Remaining before upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanle-ux commented Jun 9, 2026 •

edited

Loading

How to configure (flags / Helm `extraArgs` / `cilium-config` ConfigMap)