[Ray] Elastic Launch on Ray Cluster

## Description

Support elastic training on Ray Cluster.

## Motivation/Background

Training can tolerate node failures.
The number of worker nodes can expand as the size of the cluster grows.

## Detailed Proposal

Based on current implementation, there will be two major steps for this feature:
- [ ] #559 Support expanding the placement groups for command actors on the fly 
- [ ] Support fault tolerance which depends on the implementation of ray.
Ray Placement Group supports fault tolerance, and its logic is when a node dead, GCS will reschedule the placement groups on that node to other nodes.  And it introduces a problem: how do we know when a node is dead and which placement groups are being created, since we must restart the command actor on those placement groups who have been rescheduled, the reason is that those placement groups will never be removed until the training ends, and it reserves resources cannot be used by others. Currently there are two possible ways to achieve this:
    1. Disable the fault tolerance feature of Ray Placement Group, then we need find a way to monitor the living placement groups.
    2. Let the Ray GCS notifies the main process when some placement groups are being rescheduled, and we will be able to restart the command actors on those placement groups once they have been rescheduled.


## Additional context/links

[Ray Placement Group](https://docs.ray.io/en/latest/ray-core/placement-group.html#fault-tolerance)
[Support expanding the placement groups for command actors on the fly](https://github.com/pytorch/torchx/pull/559)
[Enable Notification on Node failure](https://github.com/ray-project/ray/issues/27076)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ray] Elastic Launch on Ray Cluster #569

Description

Motivation/Background

Detailed Proposal

Additional context/links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Ray] Elastic Launch on Ray Cluster #569

Description

Description

Motivation/Background

Detailed Proposal

Additional context/links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions