Skip to content

nodediscovery: make CiliumNode update retry budget configurable#1

Open
tanle-ux wants to merge 1 commit into
mainfrom
make-ciliumnode-update-retries-configurable
Open

nodediscovery: make CiliumNode update retry budget configurable#1
tanle-ux wants to merge 1 commit into
mainfrom
make-ciliumnode-update-retries-configurable

Conversation

@tanle-ux

@tanle-ux tanle-ux commented Jun 9, 2026

Copy link
Copy Markdown
Owner

What

cilium-agent is hard-coupled to the kube-apiserver on two paths, both of which exit the agent during a control-plane / API server outage, producing a cluster-wide crash-loop. This PR makes the timeout/retry budget on both paths configurable.

Path Before New flag
Running agentnodediscovery CiliumNode create/update fixed 10× / 500ms, then logging.Fatal --cilium-node-update-max-retries, --cilium-node-update-retry-backoff
Cold start — k8s client initial connection (waitForConn) hardcoded 1 min (connTimeout), then hive start fails → exit --k8s-client-connection-retry-timeout

Non-positive values fall back to the built-in defaults.

How to configure (flags / Helm extraArgs / cilium-config ConfigMap)

--cilium-node-update-max-retries=50
--cilium-node-update-retry-backoff=2s
--k8s-client-connection-retry-timeout=10m
--hive-start-timeout=10m            # see note below
  • Running-agent give-up ≈ max-retries × (per-request time + backoff).
  • Cold-start wait = min(--k8s-client-connection-retry-timeout, --hive-start-timeout).

⚠️ Temporary testing defaults

For control-plane-outage testing the defaults are set very large so the agent keeps retrying instead of exiting:

  • cilium-node-update-max-retries = 1000000 (was 10)
  • k8s-client-connection-retry-timeout = 168h (was 1m)
  • HiveStartTimeout (pkg/defaults) = 168h (was 5m) — raised so the cold-start wait is not capped by the start-hook timeout.

All three must be reverted (10 / 1m / 5m) before merging upstream.

Changes

  • pkg/nodediscovery/{cell,nodediscovery}.go + test: config-driven retry count + backoff, <=0 fallback.
  • pkg/k8s/client/{config,cell}.go: K8sClientConnectionRetryTimeout field + flag; waitForConn uses it.
  • pkg/defaults/defaults.go: raise HiveStartTimeout default (testing).

Why not just crash and let kubelet restart?

The k8s client self-heals a live connection (30s heartbeat → closeAllConns → apiserver rotation; informers re-watch with backoff), so a surviving agent reconnects on its own when the apiserver returns — keeping its eBPF datapath and informer caches. Crashing throws those away and, during a long outage, can't even complete cold start.

Remaining before upstream

  • Revert the three temporary large defaults.
  • Regenerate cmdref docs (make -C Documentation update-cmdref).
Make the cilium-agent's API-server coupling timeouts configurable via
--cilium-node-update-max-retries, --cilium-node-update-retry-backoff and
--k8s-client-connection-retry-timeout, so the agent can tolerate longer API
server outages before exiting. Defaults preserve existing behavior.

🤖 Generated with Claude Code

@tanle-ux tanle-ux force-pushed the make-ciliumnode-update-retries-configurable branch 3 times, most recently from 13fcbaa to 0eb4b72 Compare June 10, 2026 00:06
…able

cilium-agent is hard-coupled to the kube-apiserver on two paths, both of
which exit the agent during a control-plane / API server outage and drive a
cluster-wide crash-loop:

1. Running agent (nodediscovery): updateCiliumNodeResource retries a fixed
   10x / 500ms and then logging.Fatal-exits when it cannot write its
   CiliumNode resource.
2. Cold start (k8s client): waitForConn retries the initial connection for a
   hardcoded 1 minute (connTimeout) and then fails the hive start hook,
   exiting the agent before the datapath comes up.

Make both budgets configurable so operators can tune how long the agent
tolerates an API server outage before giving up:

  --cilium-node-update-max-retries      (CiliumNode create/update attempts)
  --cilium-node-update-retry-backoff    (backoff between attempts)
  --k8s-client-connection-retry-timeout (cold-start initial-connection wait)

Non-positive values fall back to the built-in defaults.

NOTE: for control-plane-outage testing the defaults are temporarily set very
large so the agent keeps retrying instead of exiting:
  - cilium-node-update-max-retries:      1000000   (was 10)
  - k8s-client-connection-retry-timeout: 168h      (was 1m)
  - HiveStartTimeout:                    168h      (was 5m, raised so the
    cold-start wait is not capped by the start-hook timeout)
Revert all three to their previous values before merging upstream.

Adds TestUpdateCiliumNodeResourceConfigurableRetries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: tanle-ux <tanle@roblox.com>
@tanle-ux tanle-ux force-pushed the make-ciliumnode-update-retries-configurable branch from 0eb4b72 to 649a087 Compare June 10, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant