Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #47578

Closed
vbalumuri opened this issue Sep 10, 2024 · 2 comments · Fixed by #48623
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks

Comments

@vbalumuri
Copy link

vbalumuri commented Sep 10, 2024

What happened + What you expected to happen

The bug:
When the ray cluster is idle, ray autoscaler v2 constantly tries to terminate worker nodes ignoring the set minimum worker nodes count. This causes the ray worker nodes count to go below the minimum set count from the kuberay chart.
Example: consider a ray cluster provisioned in kubernetes using the kuberay chart 1.1.0 with autoscaler v2 enabled and minimum worker nodes setting of 3. When idle, autoscaler terminates the worker nodes after idle timeout seconds causing the active worker node count to fall below 3 (1 or 2 or 0 sometimes for a brief period)

Due to this constant terminate and recreate cycle, sometimes we see the Actors failing as follows:

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
    class_name: DB
    actor_id: c78dd85202773ed8f736b21c72000000
    namespace: 81a40ef8-077d-4f65-be29-4d4077c23a85
The actor died because its node has died. Node Id: xxxxxx 
   the actor's node was terminated: Termination of node that's idle for 244.98 seconds.
The actor never ran - it was cancelled before it started running.

Expected behavior:
Autoscaler should honor the minimum worker nodes count setting and should not terminate the nodes when the termination causes node count to fall below the set value. This was properly happening in autoscaler v1 and this issue was introduced after upgrading to autoscaler v2.

Autoscaler logs:

2024-09-09 09:11:41,933	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:41,946 - INFO - Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,946	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,949 - INFO - Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,949	INFO event_logger.py:76 -- Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,955 - INFO - Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,955	INFO ray_stopper.py:116 -- Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,959 - INFO - Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:41,959	INFO ray_stopper.py:116 -- Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:46,986 - INFO - Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:46,986	INFO config.py:182 -- Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:47,015 - INFO - Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,015	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,028 - INFO - Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,028	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,031 - INFO - Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,031	INFO scheduler.py:1148 -- Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,032 - INFO - Adding 2 node(s) of type workergroup.
2024-09-09 09:11:47,032	INFO event_logger.py:56 -- Adding 2 node(s) of type workergroup.

Versions / Dependencies

kuberay: v1.1.0
ray-cluster helm chart: v1.1.0
ray version: 2.34.0

Reproduction script

You can use the base kuberay helm chart to deploy a ray cluster with the following changes:

head:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Conservative
    idleTimeoutSeconds: 240
  containerEnv:
    - name: RAY_enable_autoscaler_v2
      value: "1"

worker:
  replicas: 3
  minReplicas: 3
  maxReplicas: 10

Issue Severity

High: It blocks me from completing my task.

@vbalumuri vbalumuri added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 10, 2024
@vbalumuri vbalumuri changed the title [Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout [Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Sep 10, 2024
@anyscalesam anyscalesam added core-autoscaler autoscaler related issues P0.5 core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2024
@anyscalesam
Copy link
Contributor

@jjyao I'm putting this as p0.5 since it's on KubeRay and ASv2 is supposed to work... we could downgrade since v2 isn't quite GA yet...

@vbalumuri
Copy link
Author

vbalumuri commented Sep 13, 2024

@jjyao I'm putting this as p0.5 since it's on KubeRay and ASv2 is supposed to work... we could downgrade since v2 isn't quite GA yet...

@anyscalesam Wouldn't this be related to ray itself since the autoscaler container is nothing but the ray image?
Yes, we can downgrade autoscaler but the previous version has its own issues and terminates ray nodes even when there are running tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
4 participants