Skip to content

zozhang/dgxc executor data mover #206

Merged
hemildesai merged 15 commits intoNVIDIA-NeMo:mainfrom
zoeyz101:zozhang/dgxc-copy-data
Apr 18, 2025
Merged

zozhang/dgxc executor data mover #206
hemildesai merged 15 commits intoNVIDIA-NeMo:mainfrom
zoeyz101:zozhang/dgxc-copy-data

Conversation

@zoeyz101
Copy link
Contributor

@zoeyz101 zoeyz101 commented Apr 9, 2025

This PR creates a move_data function that moves the local job directory into a PVC before launching a job. This removes the need to launch jobs from inside the DGXC Cluster.

  • adds a data movement step which moves data to an indicated PVC path before launching jobs using the DGXC Executor
    - this creates a cpu only workload, zips and moves the data and then deletes the workload for cleanliness
  • modifies the status function to use the workloads api instead of the distributed training specific api path
  • modifies the package configs to use the /nemo_run/configs as the path similar to Skyhook and Slurm
  • moves the creation of the launch_script into the launch function before data movement
  • adds pvc_nemo_run_dir and pvc_job_dir variables

Copy link
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, @zoeyz101, thanks for putting this together!

zoeyz101 added 9 commits April 9, 2025 22:59
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…de to all use status to get workload status

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
@zoeyz101 zoeyz101 force-pushed the zozhang/dgxc-copy-data branch from f5bc150 to d5f3d04 Compare April 10, 2025 06:00
zoeyz101 added 2 commits April 9, 2025 23:10
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
@zoeyz101 zoeyz101 marked this pull request as ready for review April 10, 2025 17:56
roclark
roclark previously approved these changes Apr 10, 2025
def create_distributed_job(
self, token: str, project_id: str, cluster_id: str, name: str, cmd: list[str]
):
def copy_directory_data_command(self, local_dir_path: str, dest_path: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use rsync here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my research, rsync does not seem to have support for data movement to a kubernetes pod.

)
return response

def move_data(self, token: str, project_id: str, cluster_id: str, sleep: float = 10) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to keep a single data mover workload alive? Creating and deleting for each job will result in longer submission times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good next step!

Our biggest concern was that the data mover workload takes up unnecessary space in an gpu node if its stays open as the run:ai scheduler does not prevent cpu workloads from being scheduled onto gpu nodes. We would also have to connect build kubernetes python sdk support to copy data into the data mover workload.
In testing, the data mover workload seems to be created and deleted pretty quickly so for now I think its the quickest way to build this support in.

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
hemildesai
hemildesai previously approved these changes Apr 14, 2025
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
)
self.experiment_id = exp_id

def get_launcher_prefix(self) -> Optional[list[str]]:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
@hemildesai hemildesai merged commit 0d271a9 into NVIDIA-NeMo:main Apr 18, 2025
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants