fix: Ciliums kube-proxy replacement rollout and wait #1307

dkoshkin · 2025-09-17T16:00:57Z

What problem does this PR solve?:
Make the migration process from kube-proxy to Cilium's kube-proxy replacement more resilient.
Just setting kubeProxyReplacement: true is not enough for the Cilium operator to restart the DaemonSet Pods and pickup the new configuration. Instead of relying on k8sServiceHost to cause a rollout, this change forces a rollout during the migration process.
This also fixes a potential race where the Cilium DaemonSet wait returned early and delete kube-proxy before all the Pods were restarted.

Another fix here is that this whole migration process is now safer and only done once when kube-proxy is installed.

Pulled out from #1295

Which issue(s) this PR fixes:
Fixes #

How Has This Been Tested?:

Special notes for your reviewer:

Using builtin topolgy templating results in a deadlock because the .ControlPlane is not updated until after the lifecycle hook returns.

charts/cluster-api-runtime-extensions-nutanix/addons/cni/cilium/values-template.yaml

supershal · 2025-09-22T23:01:08Z

Should we also remove setting cni.chainingmode=portmap when kubeproxyreplacement is set to true? https://docs.cilium.io/en/latest/installation/cni-chaining-portmap/
I came across it when reading doc for EKS chainingmode requirement.

jimmidyson · 2025-09-23T08:39:40Z

Should we also remove setting cni.chainingmode=portmap when kubeproxyreplacement is set to true? docs.cilium.io/en/latest/installation/cni-chaining-portmap I came across it when reading doc for EKS chainingmode requirement.

Yes I think this should be removed. Please see https://github.com/nutanix-cloud-native/cluster-api-runtime-extensions-nutanix/pull/1255/files#diff-366132233cc4268f6ebd520d434ccb6de1a5f7b204b5ffbe6988806345af9e4eR1-R50 for my original attempt at updated values for this use case.

pkg/handlers/generic/lifecycle/cni/cilium/handler.go

dkoshkin · 2025-09-23T15:26:12Z

Should we also remove setting cni.chainingmode=portmap when kubeproxyreplacement is set to true? docs.cilium.io/en/latest/installation/cni-chaining-portmap I came across it when reading doc for EKS chainingmode requirement.

Yes I think this should be removed. Please see https://github.com/nutanix-cloud-native/cluster-api-runtime-extensions-nutanix/pull/1255/files#diff-366132233cc4268f6ebd520d434ccb6de1a5f7b204b5ffbe6988806345af9e4eR1-R50 for my original attempt at updated values for this use case.

@jimmidyson the link just brings me to the PR, mind sharing which file its pointing to?

dkoshkin · 2025-09-23T15:33:37Z

Should we also remove setting cni.chainingmode=portmap when kubeproxyreplacement is set to true? https://docs.cilium.io/en/latest/installation/cni-chaining-portmap/ I came across it when reading doc for EKS chainingmode requirement.

Thanks! Going to open a stacked PR since its not really related to this change.

jimmidyson · 2025-09-24T08:47:46Z

@jimmidyson the link just brings me to the PR, mind sharing which file its pointing to?

Hmm the link works fine for me. Let's try again - https://github.com/nutanix-cloud-native/cluster-api-runtime-extensions-nutanix/pull/1255/files#diff-366132233cc4268f6ebd520d434ccb6de1a5f7b204b5ffbe6988806345af9e4eR1-R50

Looking at file hack/examples/bases/eks/cluster/cluster.yaml and the content in lines 1-50

pkg/handlers/generic/lifecycle/cni/cilium/template.go

pkg/handlers/generic/lifecycle/cni/cilium/handler.go

jimmidyson

Looks good to me! Thanks!

test-deployment-service.yaml

supershal

LGTM. thanks

pkg/handlers/generic/lifecycle/cni/cilium/handler.go

**What problem does this PR solve?**: - Stacked on #1307 to reuse some functions and reduce merge conflicts. - Sets Cilium default configuration for EKS to enable `eni` mode. **Which issue(s) this PR fixes**: Fixes # **How Has This Been Tested?**:  - unit test for cilium template rendering Tested manually for now. - sample EKS cluster manifest ``` apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: annotations: preflight.cluster.caren.nutanix.com/skip: all labels: cluster.x-k8s.io/provider: eks name: shalin-eks spec: topology: class: eks-quick-start controlPlane: metadata: annotations: controlplane.cluster.x-k8s.io/skip-kube-proxy: "" variables: - name: clusterConfig value: addons: clusterAutoscaler: {} cni: provider: Cilium csi: defaultStorage: provider: aws-ebs storageClassConfig: default providers: aws-ebs: storageClassConfigs: default: {} snapshotController: {} nfd: {} eks: region: us-west-2 version: v1.32.9 workers: machineDeployments: - class: default-worker metadata: annotations: cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "2" cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "2" name: md-0 variables: overrides: - name: workerConfig value: eks: instanceType: m5.2xlarge ``` - Cilium HCP ``` apiVersion: addons.cluster.x-k8s.io/v1alpha1 kind: HelmChartProxy ...<redacted> ... valuesTemplate: |- cni: chainingMode: portmap exclusive: false hubble: enabled: true tls: auto: enabled: true # enable automatic TLS certificate generation method: cronJob # auto generate certificates using cronJob method certValidityDuration: 60 # certificates validity duration in days (default 2 months) schedule: "0 0 1 * *" # schedule on the 1st day regeneration of each month relay: enabled: true tls: server: enabled: true mtls: true image: useDigest: false priorityClassName: system-cluster-critical ipam: mode: eni image: useDigest: false operator: image: useDigest: false certgen: image: useDigest: false socketLB: hostNamespaceOnly: true envoy: image: useDigest: false kubeProxyReplacement: true k8sServiceHost: "A535486E46D73CBF3C959CAE8F6831A4.gr7.us-west-2.eks.amazonaws.com" k8sServicePort: "443" enableIPv4Masquerade: false eni: enabled: true awsReleaseExcessIPs: true routingMode: native endpointRoutes: enabled: true version: 1.17.4 ......<redacted> ... matchingClusters: - apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster name: shalin-eks namespace: default observedGeneration: 3 ``` - CNI and all pods running on the cluster ``` ❯ kubectl get pods -A --kubeconfig shalin-eks.conf NAMESPACE NAME READY STATUS RESTARTS AGE default cluster-autoscaler-01997478-76f0-799f-b9ed-ddbff8eab94f-5ffdxsm 0/1 ContainerCreating 0 154m kube-system cilium-envoy-2cxcb 1/1 Running 0 108m kube-system cilium-envoy-nfp4b 1/1 Running 0 92m kube-system cilium-operator-84796b9ccf-h4lgf 1/1 Running 0 108m kube-system cilium-operator-84796b9ccf-v5hq4 1/1 Running 0 96m kube-system cilium-t4grn 1/1 Running 0 92m kube-system cilium-zf2zm 1/1 Running 0 108m kube-system coredns-5449774944-4hjxt 1/1 Running 0 155m kube-system coredns-5449774944-78897 1/1 Running 0 155m kube-system ebs-csi-controller-cb84bcd9-7qtqz 6/6 Running 0 154m kube-system ebs-csi-controller-cb84bcd9-dn6fg 6/6 Running 0 154m kube-system ebs-csi-node-m57pl 3/3 Running 0 153m kube-system ebs-csi-node-nnpcn 3/3 Running 0 92m kube-system hubble-relay-6b586bc6d-wd7c7 1/1 Running 0 108m kube-system snapshot-controller-6b6bf6cb95-xrm8q 1/1 Running 0 154m node-feature-discovery node-feature-discovery-gc-6489bd687c-k2psp 1/1 Running 0 154m node-feature-discovery node-feature-discovery-master-6fc5c44fb9-2bddp 1/1 Running 0 154m node-feature-discovery node-feature-discovery-worker-tdv67 1/1 Running 0 92m node-feature-discovery node-feature-discovery-worker-wwr2x 1/1 Running 0 153m ``` - `cilium-config` configmap on EKS cluster updated to reflect eni ipam. **Special notes for your reviewer**:  --------- Co-authored-by: Dimitri Koshkin <[email protected]>

dkoshkin added 3 commits September 17, 2025 08:50

fix: gotemplate Cilium HCP

412a2ae

Using builtin topolgy templating results in a deadlock because the .ControlPlane is not updated until after the lifecycle hook returns.

fix: rollout and wait for Cilium DaemonSet

325b7cd

test: log during Service tests

1cfe335

dkoshkin requested review from jimmidyson and dlipovetsky September 17, 2025 16:00

github-actions bot added the fix label Sep 17, 2025

dkoshkin commented Sep 17, 2025

View reviewed changes

charts/cluster-api-runtime-extensions-nutanix/addons/cni/cilium/values-template.yaml Show resolved Hide resolved

supershal mentioned this pull request Sep 23, 2025

feat: cilium configuration overrides for EKS provider #1316

Merged

jimmidyson reviewed Sep 23, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/handler.go Outdated Show resolved Hide resolved

fixup! fix: rollout and wait for Cilium DaemonSet

803349f

dkoshkin requested a review from jimmidyson September 23, 2025 15:40

dkoshkin assigned supershal and unassigned supershal Sep 23, 2025

dkoshkin requested a review from supershal September 23, 2025 16:12

jimmidyson reviewed Sep 24, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/template.go Outdated Show resolved Hide resolved

jimmidyson reviewed Sep 24, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/handler.go Outdated Show resolved Hide resolved

jimmidyson reviewed Sep 24, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/handler.go Outdated Show resolved Hide resolved

fixup! refactor: Preallocate annotations slice (nit!)

be62327

jimmidyson approved these changes Sep 24, 2025

View reviewed changes

dkoshkin added 2 commits September 24, 2025 13:15

fixup! fix: gotemplate Cilium HCP

4a71540

fixup! fix: rollout and wait for Cilium DaemonSet

2015a0b

supershal reviewed Sep 24, 2025

View reviewed changes

test-deployment-service.yaml Outdated Show resolved Hide resolved

supershal reviewed Sep 24, 2025

View reviewed changes

supershal approved these changes Sep 24, 2025

View reviewed changes

fix: remove test file

936bbec

dlipovetsky reviewed Sep 25, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/handler.go Show resolved Hide resolved

dlipovetsky reviewed Sep 25, 2025

View reviewed changes

pkg/handlers/generic/lifecycle/cni/cilium/handler.go Show resolved Hide resolved

dlipovetsky approved these changes Sep 25, 2025

View reviewed changes

dkoshkin merged commit 6b71df2 into main Sep 25, 2025
67 of 70 checks passed

dkoshkin deleted the dkoshkin/fix-kube-proxy-replacement-wait branch September 25, 2025 13:33

supershal mentioned this pull request Sep 29, 2025

release(main): v0.34.0 #1331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Ciliums kube-proxy replacement rollout and wait #1307

fix: Ciliums kube-proxy replacement rollout and wait #1307

Uh oh!

dkoshkin commented Sep 17, 2025

Uh oh!

Uh oh!

supershal commented Sep 22, 2025 •

edited

Loading

Uh oh!

jimmidyson commented Sep 23, 2025

Uh oh!

Uh oh!

dkoshkin commented Sep 23, 2025

Uh oh!

dkoshkin commented Sep 23, 2025

Uh oh!

jimmidyson commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimmidyson left a comment

Uh oh!

Uh oh!

supershal left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: Ciliums kube-proxy replacement rollout and wait #1307

fix: Ciliums kube-proxy replacement rollout and wait #1307

Uh oh!

Conversation

dkoshkin commented Sep 17, 2025

Uh oh!

Uh oh!

supershal commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmidyson commented Sep 23, 2025

Uh oh!

Uh oh!

dkoshkin commented Sep 23, 2025

Uh oh!

dkoshkin commented Sep 23, 2025

Uh oh!

jimmidyson commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimmidyson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

supershal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

supershal commented Sep 22, 2025 •

edited

Loading