-
Notifications
You must be signed in to change notification settings - Fork 8
fix: Ciliums kube-proxy replacement rollout and wait #1307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using builtin topolgy templating results in a deadlock because the .ControlPlane is not updated until after the lifecycle hook returns.
charts/cluster-api-runtime-extensions-nutanix/addons/cni/cilium/values-template.yaml
Show resolved
Hide resolved
Should we also remove setting |
Yes I think this should be removed. Please see https://github.com/nutanix-cloud-native/cluster-api-runtime-extensions-nutanix/pull/1255/files#diff-366132233cc4268f6ebd520d434ccb6de1a5f7b204b5ffbe6988806345af9e4eR1-R50 for my original attempt at updated values for this use case. |
@jimmidyson the link just brings me to the PR, mind sharing which file its pointing to? |
Thanks! Going to open a stacked PR since its not really related to this change. |
Hmm the link works fine for me. Let's try again - https://github.com/nutanix-cloud-native/cluster-api-runtime-extensions-nutanix/pull/1255/files#diff-366132233cc4268f6ebd520d434ccb6de1a5f7b204b5ffbe6988806345af9e4eR1-R50 Looking at file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks
**What problem does this PR solve?**: - Stacked on #1307 to reuse some functions and reduce merge conflicts. - Sets Cilium default configuration for EKS to enable `eni` mode. **Which issue(s) this PR fixes**: Fixes # **How Has This Been Tested?**: <!-- Please describe the tests that you ran to verify your changes. Provide output from the tests and any manual steps needed to replicate the tests. --> - unit test for cilium template rendering Tested manually for now. - sample EKS cluster manifest ``` apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: annotations: preflight.cluster.caren.nutanix.com/skip: all labels: cluster.x-k8s.io/provider: eks name: shalin-eks spec: topology: class: eks-quick-start controlPlane: metadata: annotations: controlplane.cluster.x-k8s.io/skip-kube-proxy: "" variables: - name: clusterConfig value: addons: clusterAutoscaler: {} cni: provider: Cilium csi: defaultStorage: provider: aws-ebs storageClassConfig: default providers: aws-ebs: storageClassConfigs: default: {} snapshotController: {} nfd: {} eks: region: us-west-2 version: v1.32.9 workers: machineDeployments: - class: default-worker metadata: annotations: cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "2" cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "2" name: md-0 variables: overrides: - name: workerConfig value: eks: instanceType: m5.2xlarge ``` - Cilium HCP ``` apiVersion: addons.cluster.x-k8s.io/v1alpha1 kind: HelmChartProxy ...<redacted> ... valuesTemplate: |- cni: chainingMode: portmap exclusive: false hubble: enabled: true tls: auto: enabled: true # enable automatic TLS certificate generation method: cronJob # auto generate certificates using cronJob method certValidityDuration: 60 # certificates validity duration in days (default 2 months) schedule: "0 0 1 * *" # schedule on the 1st day regeneration of each month relay: enabled: true tls: server: enabled: true mtls: true image: useDigest: false priorityClassName: system-cluster-critical ipam: mode: eni image: useDigest: false operator: image: useDigest: false certgen: image: useDigest: false socketLB: hostNamespaceOnly: true envoy: image: useDigest: false kubeProxyReplacement: true k8sServiceHost: "A535486E46D73CBF3C959CAE8F6831A4.gr7.us-west-2.eks.amazonaws.com" k8sServicePort: "443" enableIPv4Masquerade: false eni: enabled: true awsReleaseExcessIPs: true routingMode: native endpointRoutes: enabled: true version: 1.17.4 ......<redacted> ... matchingClusters: - apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster name: shalin-eks namespace: default observedGeneration: 3 ``` - CNI and all pods running on the cluster ``` ❯ kubectl get pods -A --kubeconfig shalin-eks.conf NAMESPACE NAME READY STATUS RESTARTS AGE default cluster-autoscaler-01997478-76f0-799f-b9ed-ddbff8eab94f-5ffdxsm 0/1 ContainerCreating 0 154m kube-system cilium-envoy-2cxcb 1/1 Running 0 108m kube-system cilium-envoy-nfp4b 1/1 Running 0 92m kube-system cilium-operator-84796b9ccf-h4lgf 1/1 Running 0 108m kube-system cilium-operator-84796b9ccf-v5hq4 1/1 Running 0 96m kube-system cilium-t4grn 1/1 Running 0 92m kube-system cilium-zf2zm 1/1 Running 0 108m kube-system coredns-5449774944-4hjxt 1/1 Running 0 155m kube-system coredns-5449774944-78897 1/1 Running 0 155m kube-system ebs-csi-controller-cb84bcd9-7qtqz 6/6 Running 0 154m kube-system ebs-csi-controller-cb84bcd9-dn6fg 6/6 Running 0 154m kube-system ebs-csi-node-m57pl 3/3 Running 0 153m kube-system ebs-csi-node-nnpcn 3/3 Running 0 92m kube-system hubble-relay-6b586bc6d-wd7c7 1/1 Running 0 108m kube-system snapshot-controller-6b6bf6cb95-xrm8q 1/1 Running 0 154m node-feature-discovery node-feature-discovery-gc-6489bd687c-k2psp 1/1 Running 0 154m node-feature-discovery node-feature-discovery-master-6fc5c44fb9-2bddp 1/1 Running 0 154m node-feature-discovery node-feature-discovery-worker-tdv67 1/1 Running 0 92m node-feature-discovery node-feature-discovery-worker-wwr2x 1/1 Running 0 153m ``` - `cilium-config` configmap on EKS cluster updated to reflect eni ipam. **Special notes for your reviewer**: <!-- Use this to provide any additional information to the reviewers. This may include: - Best way to review the PR. - Where the author wants the most review attention on. - etc. --> --------- Co-authored-by: Dimitri Koshkin <[email protected]>
What problem does this PR solve?:
Make the migration process from kube-proxy to Cilium's kube-proxy replacement more resilient.
Just setting
kubeProxyReplacement: true
is not enough for the Cilium operator to restart the DaemonSet Pods and pickup the new configuration. Instead of relying onk8sServiceHost
to cause a rollout, this change forces a rollout during the migration process.This also fixes a potential race where the Cilium DaemonSet wait returned early and delete kube-proxy before all the Pods were restarted.
Another fix here is that this whole migration process is now safer and only done once when kube-proxy is installed.
Pulled out from #1295
Which issue(s) this PR fixes:
Fixes #
How Has This Been Tested?:
Special notes for your reviewer: