first test of degraded OnStart#27
Conversation
this refires heartbeat and k8s vers check
| connTimeout = time.Minute | ||
| connRetryInterval = 5 * time.Second | ||
| k8sHeartbeatControllerGroup = controller.NewGroup("k8s-heartbeat") | ||
| k8sConnRecoveryControllerGroup = controller.NewGroup("k8s-conn-recovery") |
There was a problem hiding this comment.
This var block is not gofmt-aligned — gofmt -l flags the file. Run gofmt -w (the = columns need re-aligning after adding the longer name), otherwise the make checkpatch/gofmt CI gate will fail.
| logfields.Error, err, | ||
| ) | ||
| c.startConnRecovery() | ||
| return nil |
There was a problem hiding this comment.
In the normal path k8sversion.Update() (l.259) populates the global version + capabilities before onStart returns. The degraded path returns before that runs, so the agent operates with default/empty capabilities until recovery succeeds. Is downstream code that reads k8sversion.Capabilities() safe with unset values during the degraded window?
There was a problem hiding this comment.
this I should dig into more. I wonder what the default capabilities are? what would be missing without the version check succeeding. I'll check it out
| c.logger.Warn("Unable to connect to k8s API server on startup; continuing in degraded state", | ||
| logfields.Error, err, | ||
| ) | ||
| c.startConnRecovery() |
There was a problem hiding this comment.
This degraded-start feature doesn't extend to the operator, which embeds the same client.Cell and flag. When this path returns nil with the version unset, runOperator (operator/cmd/root.go:525-530) sees MinimalVersionMet == false and calls logging.Fatal, so the operator exits during the outage instead of running degraded. Not a regression (the operator already fails to start when the apiserver is unreachable), but if degraded-start is meant to cover the operator, that version gate needs the same relaxation.
There was a problem hiding this comment.
This is good to know but it does not impact operator.
There was a problem hiding this comment.
ah ya. I didn't even think of the operator
| logfields.IPAddr, c.restConfigManager.getConfig().Host, | ||
| logfields.Error, err, | ||
| ) | ||
| return nil |
There was a problem hiding this comment.
Returning nil while the apiserver is still unreachable makes the controller record a success every interval (growing successCount, lastError=nil), so cilium status --all-controllers and the controller metrics show zero failures and operators can't see the node is stuck degraded. Prefer return err here (and optionally attach a Health reporter) so status/metrics reflect the degraded state — note this also engages the error-backoff cadence, so pair it with MaxRetryInterval if you want to keep a steady retry rate.
| c.logger.Info("Re-established connection to API server. Exiting degraded state", | ||
| logfields.IPAddr, c.restConfigManager.getConfig().Host, | ||
| ) | ||
| // start the heartbeat as this was previously skipped | ||
| c.startHeartbeat() | ||
|
|
||
| // do the k8s version check. might remove | ||
| if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil { | ||
| c.logger.Warn("k8s version check failed after reconnect", logfields.Error, err) | ||
| } else if !k8sversion.Capabilities().MinimalVersionMet { | ||
| c.logger.Warn("k8s version does not meet minimal standardc", | ||
| "version", k8sversion.Version(), | ||
| "minVersion", k8sversion.MinimalVersionConstraint, | ||
| ) | ||
| } | ||
|
|
||
| c.controller.RemoveController(controllerName) | ||
| return nil |
There was a problem hiding this comment.
RemoveController runs unconditionally here, even when k8sversion.Update() only logged a Warn. A successful isConnReady (a kube-system GET) does not guarantee ServerVersion() succeeds, so the controller can be destroyed with the cached version still zero-valued for the process lifetime — CEP sync (endpointsynchronizer.go:~120) then stays permanently broken while the agent looks healthy.
Reorder so the version gate runs first and the controller is only torn down once the version is confirmed; otherwise return err to keep retrying. This also resolves the fatal-vs-warn asymmetry (degraded now stays degraded and retries instead of silently running on an unsupported version), starts the heartbeat only after the connection is confirmed (avoiding a premature/duplicate heartbeat), drops the leftover // might remove note, and fixes the standardc typo.
Note: return err switches the controller to its error-backoff path (errorRetries * 1s, uncapped). If you want to keep the steady 5s cadence, also set MaxRetryInterval: connRetryInterval (or an ErrorRetryBaseDuration) on the ControllerParams.
| c.logger.Info("Re-established connection to API server. Exiting degraded state", | |
| logfields.IPAddr, c.restConfigManager.getConfig().Host, | |
| ) | |
| // start the heartbeat as this was previously skipped | |
| c.startHeartbeat() | |
| // do the k8s version check. might remove | |
| if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil { | |
| c.logger.Warn("k8s version check failed after reconnect", logfields.Error, err) | |
| } else if !k8sversion.Capabilities().MinimalVersionMet { | |
| c.logger.Warn("k8s version does not meet minimal standardc", | |
| "version", k8sversion.Version(), | |
| "minVersion", k8sversion.MinimalVersionConstraint, | |
| ) | |
| } | |
| c.controller.RemoveController(controllerName) | |
| return nil | |
| // A successful isConnReady (kube-system GET) does not guarantee the | |
| // version discovery call below succeeds. Only exit degraded state once | |
| // the version is confirmed; until then stay degraded and let the | |
| // controller retry rather than tearing it down with an unset version. | |
| if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil { | |
| c.logger.Warn("k8s version check failed after reconnect; staying degraded", logfields.Error, err) | |
| return err | |
| } | |
| if !k8sversion.Capabilities().MinimalVersionMet { | |
| return fmt.Errorf("k8s version (%v) does not meet minimal requirement (%v); staying degraded", | |
| k8sversion.Version(), k8sversion.MinimalVersionConstraint) | |
| } | |
| c.logger.Info("Re-established connection to API server. Exiting degraded state", | |
| logfields.IPAddr, c.restConfigManager.getConfig().Host, | |
| ) | |
| // Start the heartbeat (skipped during degraded onStart) now that the | |
| // connection is confirmed usable, then stop retrying. | |
| c.startHeartbeat() | |
| c.controller.RemoveController(controllerName) | |
| return nil |
| flags.Duration(option.K8sClientConnectionKeepAlive, def.K8sClientConnectionKeepAlive, "Configures the keep alive duration of K8s client connections. K8 client is disabled if the value is set to 0") | ||
| flags.Duration(option.K8sHeartbeatTimeout, def.K8sHeartbeatTimeout, "Configures the timeout for api-server heartbeat, set to 0 to disable") | ||
| flags.Bool(option.K8sEnableAPIDiscovery, def.EnableK8sAPIDiscovery, "Enable discovery of Kubernetes API groups and resources with the discovery API") | ||
| flags.Bool(option.IgnoreApiserverFailOnStart, def.IgnoreApiserverFailOnStart, "When true, failure to connect to the k8s API server on startup is non-fatal; the agent starts in a degraded state") |
There was a problem hiding this comment.
New agent flag — regenerate the cmdref docs (make -C Documentation update-cmdref) or the Documentation/cmdref CI check will fail.
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXXline if the commit addresses a particularGitHub issue.
Fixes: <commit-id>tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Fixes: #issue-number