[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

vbalumuri · 2024-09-10T06:20:12Z

What happened + What you expected to happen

The bug:
When the ray cluster is idle, ray autoscaler v2 constantly tries to terminate worker nodes ignoring the set minimum worker nodes count. This causes the ray worker nodes count to go below the minimum set count from the kuberay chart.
Example: consider a ray cluster provisioned in kubernetes using the kuberay chart 1.1.0 with autoscaler v2 enabled and minimum worker nodes setting of 3. When idle, autoscaler terminates the worker nodes after idle timeout seconds causing the active worker node count to fall below 3 (1 or 2 or 0 sometimes for a brief period)

Due to this constant terminate and recreate cycle, sometimes we see the Actors failing as follows:

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
    class_name: DB
    actor_id: c78dd85202773ed8f736b21c72000000
    namespace: 81a40ef8-077d-4f65-be29-4d4077c23a85
The actor died because its node has died. Node Id: xxxxxx 
   the actor's node was terminated: Termination of node that's idle for 244.98 seconds.
The actor never ran - it was cancelled before it started running.

Expected behavior:
Autoscaler should honor the minimum worker nodes count setting and should not terminate the nodes when the termination causes node count to fall below the set value. This was properly happening in autoscaler v1 and this issue was introduced after upgrading to autoscaler v2.

Autoscaler logs:

2024-09-09 09:11:41,933	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:41,946 - INFO - Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,946	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,949 - INFO - Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,949	INFO event_logger.py:76 -- Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,955 - INFO - Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,955	INFO ray_stopper.py:116 -- Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,959 - INFO - Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:41,959	INFO ray_stopper.py:116 -- Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:46,986 - INFO - Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:46,986	INFO config.py:182 -- Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:47,015 - INFO - Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,015	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,028 - INFO - Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,028	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,031 - INFO - Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,031	INFO scheduler.py:1148 -- Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,032 - INFO - Adding 2 node(s) of type workergroup.
2024-09-09 09:11:47,032	INFO event_logger.py:56 -- Adding 2 node(s) of type workergroup.

Versions / Dependencies

kuberay: v1.1.0
ray-cluster helm chart: v1.1.0
ray version: 2.34.0

Reproduction script

You can use the base kuberay helm chart to deploy a ray cluster with the following changes:

head:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Conservative
    idleTimeoutSeconds: 240
  containerEnv:
    - name: RAY_enable_autoscaler_v2
      value: "1"

worker:
  replicas: 3
  minReplicas: 3
  maxReplicas: 10

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

anyscalesam · 2024-09-11T17:49:09Z

@jjyao I'm putting this as p0.5 since it's on KubeRay and ASv2 is supposed to work... we could downgrade since v2 isn't quite GA yet...

vbalumuri · 2024-09-13T15:50:21Z

@jjyao I'm putting this as p0.5 since it's on KubeRay and ASv2 is supposed to work... we could downgrade since v2 isn't quite GA yet...

@anyscalesam Wouldn't this be related to ray itself since the autoscaler container is nothing but the ray image?
Yes, we can downgrade autoscaler but the previous version has its own issues and terminates ray nodes even when there are running tasks.

vbalumuri added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 10, 2024

anyscalesam added core-autoscaler autoscaler related issues P0.5 core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2024

jjyao added P1 Issue that should be fixed within a few weeks and removed P0.5 labels Oct 30, 2024

kevin85421 self-assigned this Nov 5, 2024

kevin85421 mentioned this issue Nov 7, 2024

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

Merged

8 tasks

jjyao closed this as completed in #48623 Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

vbalumuri commented Sep 10, 2024 •

edited

Loading

anyscalesam commented Sep 11, 2024

vbalumuri commented Sep 13, 2024 •

edited

Loading

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

Comments

vbalumuri commented Sep 10, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

anyscalesam commented Sep 11, 2024

vbalumuri commented Sep 13, 2024 • edited Loading

vbalumuri commented Sep 10, 2024 •

edited

Loading

vbalumuri commented Sep 13, 2024 •

edited

Loading