Cluster Autoscaler gets OOMKilled instead of scaling down #7873

MenD32 · 2025-02-26T13:13:50Z

Which component are you using?:
/area cluster-autoscaler

What version of the component are you using?:

v1.26.2

Component version:

What k8s version are you using (kubectl version)?:
1.30.0

kubectl version Output

$ kubectl version

What environment is this in?:
AWS

What did you expect to happen?:
Cluster autoscaler to scale down nodes when no longer needed

What happened instead?:
Cluster autoscaler was getting OOMKilled (We where using about ~340 nodes, with 0.6G memory limit).
When a new node of cluster autoscaler gets created, the node cooldown before scaling down gets back to 0s, Cluster Autoscaler was continuously getting OOMKilled after 1 loop and thus never managing to scale down nodes.

How to reproduce it (as minimally and precisely as possible):
Install Cluster Autoscaler with a memory limit, scale up nodes within the existing node groups until it gets OOMKilled .

Anything else we need to know?:

Getting OOMKilled from ~340 nodes with 0.6G ram limit is very surprising, but the thing that made this bug truly devastating, is that is was able to run 1 loop, which could've scaled down the nodes, but the timer had restarted, this makes me question the HA-ness of cluster autoscaler. I'd like to submit a fix where this data is added as an annotation to the node itself, thus making the deployment stateless, in that regard.

relevant configs are that the memory limit was 600mb, and that cooldown on down scaling was 10m.

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2025-05-27T14:09:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

MenD32 · 2025-05-28T09:38:32Z

/remove-lifecycle stale

MenD32 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 26, 2025

k8s-ci-robot added the area/cluster-autoscaler label Feb 26, 2025

This was referenced Feb 26, 2025

fix: added node annotation as backup method for cooldown timers #7875

Closed

fix: adding "since" annotation to k8s nodes #7883

Closed

MenD32 linked a pull request Apr 24, 2025 that will close this issue

Fix: cooldown reset on pod restart #8057

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Autoscaler gets OOMKilled instead of scaling down #7873

Cluster Autoscaler gets OOMKilled instead of scaling down #7873

MenD32 commented Feb 26, 2025 •

edited

Loading

k8s-triage-robot commented May 27, 2025

Uh oh!

MenD32 commented May 28, 2025

Uh oh!

Cluster Autoscaler gets OOMKilled instead of scaling down #7873

Cluster Autoscaler gets OOMKilled instead of scaling down #7873

Comments

MenD32 commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

k8s-triage-robot commented May 27, 2025

Uh oh!

MenD32 commented May 28, 2025

Uh oh!

MenD32 commented Feb 26, 2025 •

edited

Loading