Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not properly delete NodeClass #6462

Open
muckelba opened this issue Jul 8, 2024 · 6 comments
Open

Can not properly delete NodeClass #6462

muckelba opened this issue Jul 8, 2024 · 6 comments
Assignees
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging

Comments

@muckelba
Copy link

muckelba commented Jul 8, 2024

Description

Observed Behavior:
When deleting a NodeClass, karpenter wants to delete the nodeclaims (Waiting on NodeClaim termination for common-xdbj9, common-vvclr, common-2ppgb) but they suddenly cant find their nodepool anymore (Cannot disrupt NodeClaim: Owning nodepool "common" not found).
Karpenter just logs resolving node class, ec2nodeclasses.karpenter.sh "default" is terminating, treating as not found as soon as the deletion gets issued.

Expected Behavior:
The nodeClaims delete themselfs first and then the nodeClass.

Reproduction Steps (Please include YAML):
kubectl delete ec2nodeclasses.karpenter.k8s.aws default

Versions:

  • Chart Version: 0.36.0
  • Kubernetes Version (kubectl version): v1.29.4-eks-036c24b
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@muckelba muckelba added bug Something isn't working needs-triage Issues that need to be triaged labels Jul 8, 2024
@jmdeal
Copy link
Contributor

jmdeal commented Jul 15, 2024

Disruption refers to voluntary disruption modes: e.g. Drift, Expiration, and Consolidation. None of these can take place when the NodePool or NodeClass does not exist, hence why Karpenter can't disrupt the NodeClaim. That doesn't mean Karpenter can't terminate the NodeClaim. Deleting the NodeClass should result in Karpenter setting a deletion timestamp on each NodeClaim associated with that NodeClass, and those NodeClaims will gracefully terminate. Graceful termination isn't bounded; blocking PDBs can prevent a NodeClaim from terminating indefinitely.

If you're able to share Karpenter logs and the NodeClaim resources we should be able to determine if Karpenter is operating correctly. If it is and you want to be able to set an upper bound on termination time, you'll probably be interested in kubernetes-sigs/karpenter#916 which just merged in the upstream repo.

@muckelba
Copy link
Author

Hey, thank you for your explanation. I just did some more testing, even without any PDBs in the cluster (except for karpenter but that's running on fargate), the nodes wont terminate.

  • Karpenter itself just logs {"level":"ERROR","time":"2024-07-24T11:23:01.154Z","logger":"controller.disruption","message":"listing instance types for common, resolving node class, ec2nodeclasses.karpenter.sh \"default-ec2nodeclass\" is terminating, treating as not found","commit":"6b868db"} every 10 seconds or so.
  • The ec2nodeclass states:
    Type    Reason                         Age                   From       Message
    ----    ------                         ----                  ----       -------
    Normal  WaitingOnNodeClaimTermination  14m (x14 over 175m)   karpenter  Waiting on NodeClaim termination for common-zvvgb, common-n7mqv
    
  • The NodeClaims states:
    Type    Reason             Age                    From       Message
    ----    ------             ----                   ----       -------
    Normal  DisruptionBlocked  3m31s (x88 over 177m)  karpenter  Cannot disrupt NodeClaim: Owning nodepool "common" not found
    
  • The NodePool states:
    Type     Reason  Age                    From       Message
    ----     ------  ----                   ----       -------
    Warning          4m40s (x633 over 21h)  karpenter  Failed resolving NodeClass
    
  • The Node states:
    Type    Reason             Age                  From       Message
    ----    ------             ----                 ----       -------
    Normal  DisruptionBlocked  70s (x90 over 179m)  karpenter  Cannot disrupt Node: Owning nodepool "common" not found
    

That's everything i can find that is relating to the deletion.


How does the release process of karpenter go? There's the merge in kubernetes-sigs/karpenter and then the cloud specific providers (aws in this case) has to implement and release it too?

@NicoForce
Copy link

NicoForce commented Sep 26, 2024

Chiming in here with the same issue, the message has changed a little, I'm currently in Karpenter 1.0.0

Looking at kubectl get events:

karpenter Cannot disrupt NodeClaim: NodePool "default" not found

And this happens when the following is executed, and the command gets permanently stuck

kubectl delete ec2nodeclass default

It's not clear to me yet if my terraform code tried to delete the nodepool first or the ec2nodeclass first, but in any case, deletion is stuck and nothing happens.

Can anyone clarify what would be the correct process to remove the nodepool and the respectives nodeclasses? Should the nodeclasses be deleted first?

EDIT: I checked and there was no reason why the nodepool would have changed, so this issue is triggered by deleting the ec2nodeclass

@GerBriones
Copy link

In my case I saw different related elements that "block" the deletion:

1.- NodeClaim termination

Events:
  Type    Reason                         Age                From       Message
  ----    ------                         ----               ----       -------
  Normal  WaitingOnNodeClaimTermination  39m (x4 over 69m)  karpenter  Waiting on NodeClaim termination for default-4jf2r

2.- NodeClaim (and node) can't be deleted because Pod Disruption Budget
(PDB) present in some pods running on kube-system namespaces with this spec:

spec:
  maxUnavailable: 1

3.- Finalizer

Step to remove the ec2nodeclass:

1.- Edit the deploys or pods with PDB and change the maxUnavailable from 1 to a higher number (ex: 10)
2.- Delete the nodeclaim kubectl delete nodeclaims default-4jf2r --force
3.- If the ec2nodeclass is still not deleted, edit it and remove the finalizer lines

@DanielCastronovo
Copy link

Same issue as @GerBriones, it could be nice to force the deletion.

@jonathan-innis jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Dec 13, 2024
@jonathan-innis
Copy link
Contributor

Minimally, I think there's some work that we can do here to make sure that we are less noisy in our logs and have some better error messaging cc: @jmdeal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet
Development

No branches or pull requests

6 participants