Open
Description
Description
Support elastic training on Ray Cluster.
Motivation/Background
Training can tolerate node failures.
The number of worker nodes can expand as the size of the cluster grows.
Detailed Proposal
Based on current implementation, there will be two major steps for this feature:
- ray: support expanding placement group on the fly, support torch elastic #559 Support expanding the placement groups for command actors on the fly
- Support fault tolerance which depends on the implementation of ray.
Ray Placement Group supports fault tolerance, and its logic is when a node dead, GCS will reschedule the placement groups on that node to other nodes. And it introduces a problem: how do we know when a node is dead and which placement groups are being created, since we must restart the command actor on those placement groups who have been rescheduled, the reason is that those placement groups will never be removed until the training ends, and it reserves resources cannot be used by others. Currently there are two possible ways to achieve this:- Disable the fault tolerance feature of Ray Placement Group, then we need find a way to monitor the living placement groups.
- Let the Ray GCS notifies the main process when some placement groups are being rescheduled, and we will be able to restart the command actors on those placement groups once they have been rescheduled.
Additional context/links
Ray Placement Group
Support expanding the placement groups for command actors on the fly
Enable Notification on Node failure