Skip to content

k8s: Add degraded start to survive apiserver outages on restart#31

Draft
jiashengz wants to merge 2 commits into
v1.18from
jiashengzhu/degrade-start-v1.18
Draft

k8s: Add degraded start to survive apiserver outages on restart#31
jiashengz wants to merge 2 commits into
v1.18from
jiashengzhu/degrade-start-v1.18

Conversation

@jiashengz

Copy link
Copy Markdown

When the Kubernetes apiserver is unreachable, the agent currently fails to start: the k8s clientset connection/version checks are fatal, and local node initialization blocks forever waiting for the node object from the informer. This means that if cilium-agent is killed or crashes during an apiserver outage, it cannot come back up, which takes down the datapath and the BGP control plane (gobgp).

Add an opt-in --k8s-degraded-start flag (default off). When enabled:

  • The k8s clientset connection and version checks become non-fatal. On a connection failure the agent continues booting and the heartbeat controller re-establishes the connection in the background. The apiserver version is persisted on healthy starts and restored from disk during a degraded start so server capabilities stay consistent.

  • Local node initialization is bounded by a timeout; if the apiserver is unreachable, the local node is restored from an on-disk snapshot written by a prior healthy run instead of blocking. The snapshot is refreshed on every local node change and reconciled once the apiserver is reachable again.

Snapshots are stored under the runtime state directory, which is a hostPath in the DaemonSet and therefore survives agent restarts. Default behavior is unchanged unless the flag is set.

jiashengz and others added 2 commits June 16, 2026 11:29
Don't emit (and wake up observers) when a mutator produces no change to the
local node. This prevents redundant downstream work -- most importantly the
nodediscovery CiliumNode resource writes that are triggered on every local
node emission. During a Kubernetes apiserver outage those writes fail and the
retry path ends in logging.Fatal, crashing the agent. Skipping no-op updates
keeps the agent from re-arming that fatal path when nothing actually changed.

This is a backport of the dedup behavior from upstream's StateDB-based
LocalNodeStore, adapted to the 1.18 stream-based store.

Upstream-reference: cilium#41294
Signed-off-by: Jiasheng Zhu <jiashengzhu@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 397a641)
When the Kubernetes apiserver is unreachable, the agent currently fails to
start: the k8s clientset connection/version checks are fatal, and local node
initialization blocks forever waiting for the node object from the informer.
This means that if cilium-agent is killed or crashes during an apiserver
outage, it cannot come back up, which takes down the datapath and the BGP
control plane (gobgp).

Add an opt-in --k8s-degraded-start flag (default off). When enabled:

  - The k8s clientset connection and version checks become non-fatal. On a
    connection failure the agent continues booting and the heartbeat
    controller re-establishes the connection in the background. The apiserver
    version is persisted on healthy starts and restored from disk during a
    degraded start so server capabilities stay consistent.

  - Local node initialization is bounded by a timeout; if the apiserver is
    unreachable, the local node is restored from an on-disk snapshot written
    by a prior healthy run instead of blocking. The snapshot is refreshed on
    every local node change and reconciled once the apiserver is reachable
    again.

Snapshots are stored under the runtime state directory, which is a hostPath
in the DaemonSet and therefore survives agent restarts. Default behavior is
unchanged unless the flag is set.

Signed-off-by: Jiasheng Zhu <jiashengzhu@roblox.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@jiashengz jiashengz force-pushed the jiashengzhu/degrade-start-v1.18 branch from 7ff071b to 6c2f8da Compare June 16, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant