zozhang/dgxc executor data mover #206
Conversation
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…de to all use status to get workload status Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
f5bc150 to
d5f3d04
Compare
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
| def create_distributed_job( | ||
| self, token: str, project_id: str, cluster_id: str, name: str, cmd: list[str] | ||
| ): | ||
| def copy_directory_data_command(self, local_dir_path: str, dest_path: str) -> str: |
There was a problem hiding this comment.
Is it possible to use rsync here?
There was a problem hiding this comment.
From my research, rsync does not seem to have support for data movement to a kubernetes pod.
| ) | ||
| return response | ||
|
|
||
| def move_data(self, token: str, project_id: str, cluster_id: str, sleep: float = 10) -> None: |
There was a problem hiding this comment.
Is it possible to keep a single data mover workload alive? Creating and deleting for each job will result in longer submission times.
There was a problem hiding this comment.
I think that would be a good next step!
Our biggest concern was that the data mover workload takes up unnecessary space in an gpu node if its stays open as the run:ai scheduler does not prevent cpu workloads from being scheduled onto gpu nodes. We would also have to connect build kubernetes python sdk support to copy data into the data mover workload.
In testing, the data mover workload seems to be created and deleted pretty quickly so for now I think its the quickest way to build this support in.
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
This PR creates a
move_datafunction that moves the local job directory into a PVC before launching a job. This removes the need to launch jobs from inside the DGXC Cluster.- this creates a cpu only workload, zips and moves the data and then deletes the workload for cleanliness
statusfunction to use the workloads api instead of the distributed training specific api path/nemo_run/configsas the path similar to Skyhook and Slurmlaunch_scriptinto thelaunchfunction before data movementpvc_nemo_run_dirandpvc_job_dirvariables