Can not properly delete NodeClass #6462

muckelba · 2024-07-08T09:52:06Z

Description

Observed Behavior:
When deleting a NodeClass, karpenter wants to delete the nodeclaims (Waiting on NodeClaim termination for common-xdbj9, common-vvclr, common-2ppgb) but they suddenly cant find their nodepool anymore (Cannot disrupt NodeClaim: Owning nodepool "common" not found).
Karpenter just logs resolving node class, ec2nodeclasses.karpenter.sh "default" is terminating, treating as not found as soon as the deletion gets issued.

Expected Behavior:
The nodeClaims delete themselfs first and then the nodeClass.

Reproduction Steps (Please include YAML):
kubectl delete ec2nodeclasses.karpenter.k8s.aws default

Versions:

Chart Version: 0.36.0
Kubernetes Version (kubectl version): v1.29.4-eks-036c24b

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jmdeal · 2024-07-15T18:44:28Z

Disruption refers to voluntary disruption modes: e.g. Drift, Expiration, and Consolidation. None of these can take place when the NodePool or NodeClass does not exist, hence why Karpenter can't disrupt the NodeClaim. That doesn't mean Karpenter can't terminate the NodeClaim. Deleting the NodeClass should result in Karpenter setting a deletion timestamp on each NodeClaim associated with that NodeClass, and those NodeClaims will gracefully terminate. Graceful termination isn't bounded; blocking PDBs can prevent a NodeClaim from terminating indefinitely.

If you're able to share Karpenter logs and the NodeClaim resources we should be able to determine if Karpenter is operating correctly. If it is and you want to be able to set an upper bound on termination time, you'll probably be interested in kubernetes-sigs/karpenter#916 which just merged in the upstream repo.

muckelba · 2024-07-24T11:34:06Z

Hey, thank you for your explanation. I just did some more testing, even without any PDBs in the cluster (except for karpenter but that's running on fargate), the nodes wont terminate.

Karpenter itself just logs {"level":"ERROR","time":"2024-07-24T11:23:01.154Z","logger":"controller.disruption","message":"listing instance types for common, resolving node class, ec2nodeclasses.karpenter.sh \"default-ec2nodeclass\" is terminating, treating as not found","commit":"6b868db"} every 10 seconds or so.

The ec2nodeclass states:

Type    Reason                         Age                   From       Message
----    ------                         ----                  ----       -------
Normal  WaitingOnNodeClaimTermination  14m (x14 over 175m)   karpenter  Waiting on NodeClaim termination for common-zvvgb, common-n7mqv

The NodeClaims states:

Type    Reason             Age                    From       Message
----    ------             ----                   ----       -------
Normal  DisruptionBlocked  3m31s (x88 over 177m)  karpenter  Cannot disrupt NodeClaim: Owning nodepool "common" not found

The NodePool states:

Type     Reason  Age                    From       Message
----     ------  ----                   ----       -------
Warning          4m40s (x633 over 21h)  karpenter  Failed resolving NodeClass

The Node states:

Type    Reason             Age                  From       Message
----    ------             ----                 ----       -------
Normal  DisruptionBlocked  70s (x90 over 179m)  karpenter  Cannot disrupt Node: Owning nodepool "common" not found

That's everything i can find that is relating to the deletion.

How does the release process of karpenter go? There's the merge in kubernetes-sigs/karpenter and then the cloud specific providers (aws in this case) has to implement and release it too?

NicoForce · 2024-09-26T19:09:29Z

Chiming in here with the same issue, the message has changed a little, I'm currently in Karpenter 1.0.0

Looking at kubectl get events:

karpenter Cannot disrupt NodeClaim: NodePool "default" not found

And this happens when the following is executed, and the command gets permanently stuck

kubectl delete ec2nodeclass default

It's not clear to me yet if my terraform code tried to delete the nodepool first or the ec2nodeclass first, but in any case, deletion is stuck and nothing happens.

Can anyone clarify what would be the correct process to remove the nodepool and the respectives nodeclasses? Should the nodeclasses be deleted first?

EDIT: I checked and there was no reason why the nodepool would have changed, so this issue is triggered by deleting the ec2nodeclass

GerBriones · 2024-10-08T13:08:39Z

In my case I saw different related elements that "block" the deletion:

1.- NodeClaim termination

Events:
  Type    Reason                         Age                From       Message
  ----    ------                         ----               ----       -------
  Normal  WaitingOnNodeClaimTermination  39m (x4 over 69m)  karpenter  Waiting on NodeClaim termination for default-4jf2r

2.- NodeClaim (and node) can't be deleted because Pod Disruption Budget
(PDB) present in some pods running on kube-system namespaces with this spec:

spec:
  maxUnavailable: 1

3.- Finalizer

Step to remove the ec2nodeclass:

1.- Edit the deploys or pods with PDB and change the maxUnavailable from 1 to a higher number (ex: 10)
2.- Delete the nodeclaim kubectl delete nodeclaims default-4jf2r --force
3.- If the ec2nodeclass is still not deleted, edit it and remove the finalizer lines

DanielCastronovo · 2024-11-12T15:49:18Z

Same issue as @GerBriones, it could be nice to force the deletion.

jonathan-innis · 2024-12-13T17:48:05Z

Minimally, I think there's some work that we can do here to make sure that we are less noisy in our logs and have some better error messaging cc: @jmdeal

muckelba added bug Something isn't working needs-triage Issues that need to be triaged labels Jul 8, 2024

jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Dec 13, 2024

jonathan-innis assigned jmdeal Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not properly delete NodeClass #6462

Can not properly delete NodeClass #6462

muckelba commented Jul 8, 2024

jmdeal commented Jul 15, 2024

muckelba commented Jul 24, 2024

NicoForce commented Sep 26, 2024 •

edited

Loading

GerBriones commented Oct 8, 2024

DanielCastronovo commented Nov 12, 2024

jonathan-innis commented Dec 13, 2024

Can not properly delete NodeClass #6462

Can not properly delete NodeClass #6462

Comments

muckelba commented Jul 8, 2024

Description

jmdeal commented Jul 15, 2024

muckelba commented Jul 24, 2024

NicoForce commented Sep 26, 2024 • edited Loading

GerBriones commented Oct 8, 2024

DanielCastronovo commented Nov 12, 2024

jonathan-innis commented Dec 13, 2024

NicoForce commented Sep 26, 2024 •

edited

Loading