nodediscovery: make CiliumNode update retry budget configurable#1
Open
tanle-ux wants to merge 1 commit into
Open
nodediscovery: make CiliumNode update retry budget configurable#1tanle-ux wants to merge 1 commit into
tanle-ux wants to merge 1 commit into
Conversation
13fcbaa to
0eb4b72
Compare
…able
cilium-agent is hard-coupled to the kube-apiserver on two paths, both of
which exit the agent during a control-plane / API server outage and drive a
cluster-wide crash-loop:
1. Running agent (nodediscovery): updateCiliumNodeResource retries a fixed
10x / 500ms and then logging.Fatal-exits when it cannot write its
CiliumNode resource.
2. Cold start (k8s client): waitForConn retries the initial connection for a
hardcoded 1 minute (connTimeout) and then fails the hive start hook,
exiting the agent before the datapath comes up.
Make both budgets configurable so operators can tune how long the agent
tolerates an API server outage before giving up:
--cilium-node-update-max-retries (CiliumNode create/update attempts)
--cilium-node-update-retry-backoff (backoff between attempts)
--k8s-client-connection-retry-timeout (cold-start initial-connection wait)
Non-positive values fall back to the built-in defaults.
NOTE: for control-plane-outage testing the defaults are temporarily set very
large so the agent keeps retrying instead of exiting:
- cilium-node-update-max-retries: 1000000 (was 10)
- k8s-client-connection-retry-timeout: 168h (was 1m)
- HiveStartTimeout: 168h (was 5m, raised so the
cold-start wait is not capped by the start-hook timeout)
Revert all three to their previous values before merging upstream.
Adds TestUpdateCiliumNodeResourceConfigurableRetries.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: tanle-ux <tanle@roblox.com>
0eb4b72 to
649a087
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
cilium-agentis hard-coupled to the kube-apiserver on two paths, both of which exit the agent during a control-plane / API server outage, producing a cluster-wide crash-loop. This PR makes the timeout/retry budget on both paths configurable.nodediscoveryCiliumNodecreate/updatelogging.Fatal--cilium-node-update-max-retries,--cilium-node-update-retry-backoffwaitForConn)connTimeout), then hive start fails → exit--k8s-client-connection-retry-timeoutNon-positive values fall back to the built-in defaults.
How to configure (flags / Helm
extraArgs/cilium-configConfigMap)max-retries × (per-request time + backoff).min(--k8s-client-connection-retry-timeout, --hive-start-timeout).For control-plane-outage testing the defaults are set very large so the agent keeps retrying instead of exiting:
cilium-node-update-max-retries= 1000000 (was 10)k8s-client-connection-retry-timeout= 168h (was 1m)HiveStartTimeout(pkg/defaults) = 168h (was 5m) — raised so the cold-start wait is not capped by the start-hook timeout.All three must be reverted (10 / 1m / 5m) before merging upstream.
Changes
pkg/nodediscovery/{cell,nodediscovery}.go+ test: config-driven retry count + backoff,<=0fallback.pkg/k8s/client/{config,cell}.go:K8sClientConnectionRetryTimeoutfield + flag;waitForConnuses it.pkg/defaults/defaults.go: raiseHiveStartTimeoutdefault (testing).Why not just crash and let kubelet restart?
The k8s client self-heals a live connection (30s heartbeat →
closeAllConns→ apiserver rotation; informers re-watch with backoff), so a surviving agent reconnects on its own when the apiserver returns — keeping its eBPF datapath and informer caches. Crashing throws those away and, during a long outage, can't even complete cold start.Remaining before upstream
make -C Documentation update-cmdref).🤖 Generated with Claude Code