[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-autoscaler
autoscaler related issues
P1
Issue that should be fixed within a few weeks
What happened + What you expected to happen
The bug:
When the ray cluster is idle, ray autoscaler v2 constantly tries to terminate worker nodes ignoring the set minimum worker nodes count. This causes the ray worker nodes count to go below the minimum set count from the kuberay chart.
Example: consider a ray cluster provisioned in kubernetes using the kuberay chart 1.1.0 with autoscaler v2 enabled and minimum worker nodes setting of 3. When idle, autoscaler terminates the worker nodes after idle timeout seconds causing the active worker node count to fall below 3 (1 or 2 or 0 sometimes for a brief period)
Due to this constant terminate and recreate cycle, sometimes we see the Actors failing as follows:
Expected behavior:
Autoscaler should honor the minimum worker nodes count setting and should not terminate the nodes when the termination causes node count to fall below the set value. This was properly happening in autoscaler v1 and this issue was introduced after upgrading to autoscaler v2.
Autoscaler logs:
Versions / Dependencies
kuberay: v1.1.0
ray-cluster helm chart: v1.1.0
ray version: 2.34.0
Reproduction script
You can use the base kuberay helm chart to deploy a ray cluster with the following changes:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: