zozhang/dgxc executor data mover #206

Merged

hemildesai merged 15 commits intoNVIDIA-NeMo:mainfrom

zoeyz101:zozhang/dgxc-copy-data

Apr 18, 2025

Contributor

zoeyz101 commented Apr 9, 2025 •

edited

Loading

This PR creates a move_data function that moves the local job directory into a PVC before launching a job. This removes the need to launch jobs from inside the DGXC Cluster.

adds a data movement step which moves data to an indicated PVC path before launching jobs using the DGXC Executor
- this creates a cpu only workload, zips and moves the data and then deletes the workload for cleanliness
modifies the status function to use the workloads api instead of the distributed training specific api path
modifies the package configs to use the /nemo_run/configs as the path similar to Skyhook and Slurm
moves the creation of the launch_script into the launch function before data movement
adds pvc_nemo_run_dir and pvc_job_dir variables

roclark reviewed

View reviewed changes

Contributor

roclark left a comment

Great stuff, @zoeyz101, thanks for putting this together!

nemo_run/core/execution/dgxcloud.py Outdated Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Outdated Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Outdated Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Outdated Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Outdated Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Show resolved Hide resolved

nemo_run/core/execution/dgxcloud.py Show resolved Hide resolved

zoeyz101 added 9 commits

April 9, 2025 22:59


          adding APIs for data movement

0404b3b

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          adding todos and questions

7dc9717

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          support data movement into PVC before launching job

a1fc91d

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          updating comments

fa903b6

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          refactored data movement to custom move_data function and modified co…

…de to all use status to get workload status

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          comment back commented out code

3067bfd

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          formatting and adding back PVC checker

a68bfb7

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          fixing DGXC Executor tests for data mover redesign

f39964a

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          cleanup

d5f3d04

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

zoeyz101 force-pushed the zozhang/dgxc-copy-data branch from f5bc150 to d5f3d04 Compare

April 10, 2025 06:00

zoeyz101 added 2 commits

April 9, 2025 23:10


          test package configs

77bf2c8

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>


          fix formatting

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

zoeyz101 had a problem deploying to public

April 10, 2025 17:51

— with

GitHub Actions Failure

github-advanced-security bot found potential problems

View reviewed changes

nemo_run/core/execution/dgxcloud.py Fixed Show fixed Hide fixed

zoeyz101 had a problem deploying to public

April 10, 2025 17:51

— with

GitHub Actions Failure


          removing unreachable code

1b42e64

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

zoeyz101 marked this pull request as ready for review

April 10, 2025 17:56

roclark previously approved these changes

View reviewed changes

roclark assigned hemildesai

roclark requested a review from hemildesai

April 10, 2025 17:58

zoeyz101 had a problem deploying to public

April 11, 2025 14:32

— with

GitHub Actions Failure

zoeyz101 had a problem deploying to public

April 11, 2025 14:33

— with

GitHub Actions Failure

hemildesai reviewed

View reviewed changes

nemo_run/core/execution/dgxcloud.py

-                  def create_distributed_job(
-                      self, token: str, project_id: str, cluster_id: str, name: str, cmd: list[str]
-                  ):
+                  def copy_directory_data_command(self, local_dir_path: str, dest_path: str) -> str:

Contributor

hemildesai Apr 11, 2025

Is it possible to use rsync here?

Contributor Author

zoeyz101 Apr 11, 2025

From my research, rsync does not seem to have support for data movement to a kubernetes pod.

nemo_run/core/execution/dgxcloud.py

+                      )
+                      return response
+                  def move_data(self, token: str, project_id: str, cluster_id: str, sleep: float = 10) -> None:

Contributor

hemildesai Apr 11, 2025

Is it possible to keep a single data mover workload alive? Creating and deleting for each job will result in longer submission times.

Contributor Author

zoeyz101 Apr 11, 2025

I think that would be a good next step!

Our biggest concern was that the data mover workload takes up unnecessary space in an gpu node if its stays open as the run:ai scheduler does not prevent cpu workloads from being scheduled onto gpu nodes. We would also have to connect build kubernetes python sdk support to copy data into the data mover workload.
In testing, the data mover workload seems to be created and deleted pretty quickly so for now I think its the quickest way to build this support in.


          add flag for launching from inside the cluster

4c15581

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

zoeyz101 dismissed roclark’s stale review via

4c15581

April 14, 2025 20:51

hemildesai previously approved these changes

View reviewed changes

zoeyz101 had a problem deploying to public

April 14, 2025 21:01

— with

GitHub Actions Failure

zoeyz101 had a problem deploying to public

April 14, 2025 21:01

— with

GitHub Actions Failure


          fix format

51aa582

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

zoeyz101 dismissed hemildesai’s stale review via

51aa582

April 14, 2025 21:51

zoeyz101 requested a review from hemildesai

April 14, 2025 23:25

zoeyz101 had a problem deploying to public

April 15, 2025 00:43

— with

GitHub Actions Failure

zoeyz101 had a problem deploying to public

April 15, 2025 00:43

— with

GitHub Actions Failure


          fixing nsys prefix

0a1c2aa

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

hemildesai approved these changes

View reviewed changes

zoeyz101 had a problem deploying to public

April 18, 2025 04:56

— with

GitHub Actions Failure

github-advanced-security bot found potential problems

View reviewed changes

nemo_run/core/execution/dgxcloud.py

                       )
+                      self.experiment_id = exp_id
+                  def get_launcher_prefix(self) -> Optional[list[str]]:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.

zoeyz101 had a problem deploying to public

April 18, 2025 04:56

— with

GitHub Actions Failure

hemildesai merged commit 0d271a9 into NVIDIA-NeMo:main

18 of 20 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet