-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893
Comments
I'm seeing this too and am a bit puzzled as to the solution |
Any updates on this ?. Or any workaround you would suggest ? |
I'm experiencing the same thing. Did you ever find an answer to this? It can scale up from 0 as expected only after I've scaled it up at least once manually while cluster autoscaler is running. So I assume it caches somehow, somewhere node info and relates it to the ASG. "Ooooh, this node has a gpu! Ok". However, I HAVE the "node-template" label and resource tags I'm "supposed to", for it to scale from 0 on its own from the ASG. And yet I still have to scale up once manually before it can scale up from 0 itself. |
I am facing the same issue, and this looks like a random behavior, sometimes it works, and sometimes it does until I scale the node group once manually. |
We are also experiencing the same issue, and I did dig a little bit
3.1, so it did try to calculate with the NodeGroup which it should be triggering scale up |
The official documentation already covers this. As we are scaling up from capacity 0, this is not possible by default in the Cluster Autoscaler, and to do so, they indicated in the official documentation that we need to manually add a tag with the corresponding The tag for the label
I tested it and it works. |
I appreciate the response and recognize you're right. In my case, however, we are tagging the ASGs and they're still not coming up properly.
The ASGs are, however, tagged with I'm not sure what, then can be the problem. |
What is the label in your case? I see that you want to use a non-label tag as a label for the node-selector. As they indicate, the tag should be
And the label should be as:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Experience same issue, from below log I think that cluster-autoscaler would remember all taints on the last node of ASG in memory before scaling down to zero regardless some taints were added automatically by other service and not existing in ASG's tags, so that if new pods does not have a toleration for the additional taint added by other service, cluster-autoscaler would think the pod is untolerated to the node group.
The workaround is either manually scaling up the ASG from 0 to 1 to refresh ASG's taint's in memory or restarting cluster-autoscaler pods to refresh everything |
any update on this i am facing the same issue. |
Same issue here, we taint nodes when draining and before shutting them down. New nodes can't be started then because cluster-autoscaler thinks all of them have these taint. |
Same issue. The autoscaler doesn't work if the we set node count = 0 or after the nodes scales down to 0. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I just experienced the same issue as described by the OP. Is this being looked at? I'm running CA on EKS 1.28 with the image registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.5 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
This happens when autoscaling group is downscaled to 0. ie desired capacity is set to 0. Now if I start the cluster autoscaler and start a pod which requires a node from this autoscaling group. Some how autoscaling is not happening.
I have defined node affinity towards this autoscaling group.
Below is the event log from pod describe.
Normal NotTriggerScaleUp 4m15s (x121 over 24m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
But it works when I manually set the desired capacity 1 (This is when cluster autoscaler is already running) and set back the desired capacity to 0. And make a new pod deployment.
Looks like cluster autoscaler is not getting the nodes details associated with the autoscaling group at the start up when desired capacity is set to 0.
The text was updated successfully, but these errors were encountered: