Skip to content

first test of degraded OnStart#27

Open
lynchc wants to merge 2 commits into
mainfrom
cl/degradedstart
Open

first test of degraded OnStart#27
lynchc wants to merge 2 commits into
mainfrom
cl/degradedstart

Conversation

@lynchc

@lynchc lynchc commented Jun 10, 2026

Copy link
Copy Markdown

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #issue-number

<!-- Enter the release note text here if needed or remove this section! -->

Comment thread pkg/k8s/client/cell.go
connTimeout = time.Minute
connRetryInterval = 5 * time.Second
k8sHeartbeatControllerGroup = controller.NewGroup("k8s-heartbeat")
k8sConnRecoveryControllerGroup = controller.NewGroup("k8s-conn-recovery")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This var block is not gofmt-aligned — gofmt -l flags the file. Run gofmt -w (the = columns need re-aligning after adding the longer name), otherwise the make checkpatch/gofmt CI gate will fail.

Comment thread pkg/k8s/client/cell.go
logfields.Error, err,
)
c.startConnRecovery()
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal path k8sversion.Update() (l.259) populates the global version + capabilities before onStart returns. The degraded path returns before that runs, so the agent operates with default/empty capabilities until recovery succeeds. Is downstream code that reads k8sversion.Capabilities() safe with unset values during the degraded window?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this I should dig into more. I wonder what the default capabilities are? what would be missing without the version check succeeding. I'll check it out

Comment thread pkg/k8s/client/cell.go
c.logger.Warn("Unable to connect to k8s API server on startup; continuing in degraded state",
logfields.Error, err,
)
c.startConnRecovery()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This degraded-start feature doesn't extend to the operator, which embeds the same client.Cell and flag. When this path returns nil with the version unset, runOperator (operator/cmd/root.go:525-530) sees MinimalVersionMet == false and calls logging.Fatal, so the operator exits during the outage instead of running degraded. Not a regression (the operator already fails to start when the apiserver is unreachable), but if degraded-start is meant to cover the operator, that version gate needs the same relaxation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to know but it does not impact operator.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ya. I didn't even think of the operator

Comment thread pkg/k8s/client/cell.go
logfields.IPAddr, c.restConfigManager.getConfig().Host,
logfields.Error, err,
)
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning nil while the apiserver is still unreachable makes the controller record a success every interval (growing successCount, lastError=nil), so cilium status --all-controllers and the controller metrics show zero failures and operators can't see the node is stuck degraded. Prefer return err here (and optionally attach a Health reporter) so status/metrics reflect the degraded state — note this also engages the error-backoff cadence, so pair it with MaxRetryInterval if you want to keep a steady retry rate.

Comment thread pkg/k8s/client/cell.go
Comment on lines +337 to +354
c.logger.Info("Re-established connection to API server. Exiting degraded state",
logfields.IPAddr, c.restConfigManager.getConfig().Host,
)
// start the heartbeat as this was previously skipped
c.startHeartbeat()

// do the k8s version check. might remove
if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil {
c.logger.Warn("k8s version check failed after reconnect", logfields.Error, err)
} else if !k8sversion.Capabilities().MinimalVersionMet {
c.logger.Warn("k8s version does not meet minimal standardc",
"version", k8sversion.Version(),
"minVersion", k8sversion.MinimalVersionConstraint,
)
}

c.controller.RemoveController(controllerName)
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveController runs unconditionally here, even when k8sversion.Update() only logged a Warn. A successful isConnReady (a kube-system GET) does not guarantee ServerVersion() succeeds, so the controller can be destroyed with the cached version still zero-valued for the process lifetime — CEP sync (endpointsynchronizer.go:~120) then stays permanently broken while the agent looks healthy.

Reorder so the version gate runs first and the controller is only torn down once the version is confirmed; otherwise return err to keep retrying. This also resolves the fatal-vs-warn asymmetry (degraded now stays degraded and retries instead of silently running on an unsupported version), starts the heartbeat only after the connection is confirmed (avoiding a premature/duplicate heartbeat), drops the leftover // might remove note, and fixes the standardc typo.

Note: return err switches the controller to its error-backoff path (errorRetries * 1s, uncapped). If you want to keep the steady 5s cadence, also set MaxRetryInterval: connRetryInterval (or an ErrorRetryBaseDuration) on the ControllerParams.

Suggested change
c.logger.Info("Re-established connection to API server. Exiting degraded state",
logfields.IPAddr, c.restConfigManager.getConfig().Host,
)
// start the heartbeat as this was previously skipped
c.startHeartbeat()
// do the k8s version check. might remove
if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil {
c.logger.Warn("k8s version check failed after reconnect", logfields.Error, err)
} else if !k8sversion.Capabilities().MinimalVersionMet {
c.logger.Warn("k8s version does not meet minimal standardc",
"version", k8sversion.Version(),
"minVersion", k8sversion.MinimalVersionConstraint,
)
}
c.controller.RemoveController(controllerName)
return nil
// A successful isConnReady (kube-system GET) does not guarantee the
// version discovery call below succeeds. Only exit degraded state once
// the version is confirmed; until then stay degraded and let the
// controller retry rather than tearing it down with an unset version.
if err := k8sversion.Update(c.logger, c, c.config.EnableK8sAPIDiscovery); err != nil {
c.logger.Warn("k8s version check failed after reconnect; staying degraded", logfields.Error, err)
return err
}
if !k8sversion.Capabilities().MinimalVersionMet {
return fmt.Errorf("k8s version (%v) does not meet minimal requirement (%v); staying degraded",
k8sversion.Version(), k8sversion.MinimalVersionConstraint)
}
c.logger.Info("Re-established connection to API server. Exiting degraded state",
logfields.IPAddr, c.restConfigManager.getConfig().Host,
)
// Start the heartbeat (skipped during degraded onStart) now that the
// connection is confirmed usable, then stop retrying.
c.startHeartbeat()
c.controller.RemoveController(controllerName)
return nil

Comment thread pkg/k8s/client/config.go
flags.Duration(option.K8sClientConnectionKeepAlive, def.K8sClientConnectionKeepAlive, "Configures the keep alive duration of K8s client connections. K8 client is disabled if the value is set to 0")
flags.Duration(option.K8sHeartbeatTimeout, def.K8sHeartbeatTimeout, "Configures the timeout for api-server heartbeat, set to 0 to disable")
flags.Bool(option.K8sEnableAPIDiscovery, def.EnableK8sAPIDiscovery, "Enable discovery of Kubernetes API groups and resources with the discovery API")
flags.Bool(option.IgnoreApiserverFailOnStart, def.IgnoreApiserverFailOnStart, "When true, failure to connect to the k8s API server on startup is non-fatal; the agent starts in a degraded state")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New agent flag — regenerate the cmdref docs (make -C Documentation update-cmdref) or the Documentation/cmdref CI check will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants