Add LeptonExecutor support by roclark · Pull Request #224 · NVIDIA-NeMo/Run

roclark · 2025-05-08T17:46:40Z

The LeptonExecutor adds support for launching jobs on NVIDIA DGX Cloud Lepton using the Lepton Python SDK. The Lepton SDK needs to be installed with pip install nemo-run[lepton] or pip install leptonai and authenticate with the cluster using the command available on the DGX Cloud Lepton UI in the Settings > Tokens page.

The LeptonExecutor can be defined using the following example:

def your_lepton_executor(nodes: int, gpus_per_node: int, container_image: str):
    # Ensure these are set correctly for your DGX Cloud environment
    # You might fetch these from environment variables or a config file
    resource_shape = "gpu.8xh100-80gb" # Replace with your desired resource shape representing the number of GPUs in a pod
    node_group = "my-node-group"  # The node group to run the job in
    nemo_run_dir = "/nemo-workspace/nemo-run"  # The NeMo-Run directory where experiments are saved
    # Define the remote storage directory that will be mounted in the job pods
    # Ensure the path specified here contains your NEMORUN_HOME
    storage_path = "/nemo-workspace" # The remote storage directory to mount in jobs
    mount_path = "/nemo-workspace" # The path where the remote storage directory will be mounted inside the container

    executor = run.LeptonExecutor(
        resource_shape=resource_shape,
        node_group=node_group,
        container_image=container_image,
        nodes=nodes,
        nemo_run_dir=nemo_run_dir,
        gpus_per_node=gpus_per_node,
        mounts=[{"path": storage_path, "mount_path": mount_path}],
        # Optional: Add custom environment variables or PyTorch specs if needed
        env_vars=common_envs(),
        # packager=run.GitArchivePackager() # Choose appropriate packager
    )
    return executor

# Example usage:
executor = your_lepton_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image")

Launching jobs with the LeptonExecutor will queue a Batch Job on the DGX Cloud Lepton cluster in the specified node group.

nemo_run/core/execution/lepton.py

+        for line in logs:
+            print(line)
+
+    def cleanup(self, handle: str): ...


nemo_run/core/execution/lepton.py

+        )
+        self.experiment_id = exp_id
+
+    def get_launcher_prefix(self) -> Optional[list[str]]:


nemo_run/run/torchx_backend/schedulers/lepton.py

+            return None
+        executor.cancel(job_id)
+
+    def list(self) -> list[ListAppResponse]: ...


hemildesai

Mostly LGTM, can you make sure the format, lint and test checks pass?

hemildesai · 2025-05-09T01:08:10Z

Also, can you add some basic tests or create an issue to add tests for the lepton executor later?

roclark · 2025-05-09T21:19:31Z

Thanks for the review, @hemildesai! I will clear up the failing tests here and add some new tests with the PR early next week. I changed up some more functionality today based on feedback I got offline on the storage API as well.

NVIDIA DGX Cloud Lepton is another platform available for launching distributed jobs using NeMo-Run. The new LeptonExecutor leverages the Lepton Python SDK to authenticate with a DGX Cloud Lepton cluster and launch jobs on available resources. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Allow users to specify custom mounts using Lepton's Filesystem functionality. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Handle more possible failure scenarios for the LeptonExecutor where the code could run into a bad state and the user should be alerted with helpful debug info. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Running a low-resource pod to copy experiment data to the Lepton cluster is more reliable and broadly compatible with various cluster types versus the storage API. Signed-Off-By: Robert Clark <roclark@nvidia.com>

test/core/execution/test_lepton.py

roclark · 2025-05-12T21:43:26Z

@hemildesai, added some tests and cleared up everything else that was failing in the PR. Let me know if there's anything else I need to change, thanks! :)

.vscode/settings.json

hemildesai

LGTM, just one small comment.

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark · 2025-05-12T22:11:40Z

Alright, should be resolved now.

hemildesai

Looks great, thanks 🎉

roclark requested review from ericharper and hemildesai May 8, 2025 17:46

roclark assigned hemildesai May 8, 2025

roclark added the enhancement New feature or request label May 8, 2025

roclark had a problem deploying to public May 8, 2025 17:47 — with GitHub Actions Failure

github-advanced-security bot found potential problems May 8, 2025

View reviewed changes

hemildesai reviewed May 9, 2025

View reviewed changes

roclark had a problem deploying to public May 9, 2025 19:00 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 6c03306 to 5f58a6c Compare May 9, 2025 21:27

roclark had a problem deploying to public May 9, 2025 21:28 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 5f58a6c to 6fe4151 Compare May 9, 2025 21:44

roclark had a problem deploying to public May 9, 2025 21:45 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 6fe4151 to c032127 Compare May 9, 2025 21:55

roclark had a problem deploying to public May 9, 2025 21:56 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from c032127 to fe4ef3f Compare May 9, 2025 21:56

roclark had a problem deploying to public May 9, 2025 21:57 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from fe4ef3f to ac81c6f Compare May 9, 2025 21:59

roclark had a problem deploying to public May 9, 2025 22:00 — with GitHub Actions Failure

roclark had a problem deploying to public May 12, 2025 14:30 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 67e163f to 565ee2c Compare May 12, 2025 14:30

roclark had a problem deploying to public May 12, 2025 14:32 — with GitHub Actions Failure

roclark added 4 commits May 12, 2025 07:33

Add custom mounts to Lepton batch jobs

100933c

Allow users to specify custom mounts using Lepton's Filesystem functionality. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Add error handling to LeptonExecutor

f66819f

Handle more possible failure scenarios for the LeptonExecutor where the code could run into a bad state and the user should be alerted with helpful debug info. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Use a low-resource pod to move data to Lepton

2071ad1

Running a low-resource pod to copy experiment data to the Lepton cluster is more reliable and broadly compatible with various cluster types versus the storage API. Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark force-pushed the roclark/leptonai branch from 565ee2c to 2071ad1 Compare May 12, 2025 14:33

roclark had a problem deploying to public May 12, 2025 14:35 — with GitHub Actions Failure

github-advanced-security bot found potential problems May 12, 2025

View reviewed changes

test/core/execution/test_lepton.py Fixed Show fixed Hide fixed

test/core/execution/test_lepton.py Fixed Show fixed Hide fixed

roclark had a problem deploying to public May 12, 2025 20:34 — with GitHub Actions Failure

roclark had a problem deploying to public May 12, 2025 20:35 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from f5a591c to 67b273b Compare May 12, 2025 21:03

roclark had a problem deploying to public May 12, 2025 21:04 — with GitHub Actions Failure

roclark had a problem deploying to public May 12, 2025 21:05 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 67b273b to 02426ac Compare May 12, 2025 21:05

roclark had a problem deploying to public May 12, 2025 21:07 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from 02426ac to c218bd1 Compare May 12, 2025 21:13

roclark had a problem deploying to public May 12, 2025 21:14 — with GitHub Actions Failure

roclark had a problem deploying to public May 12, 2025 21:15 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonai branch from c218bd1 to f6d0788 Compare May 12, 2025 21:15

roclark had a problem deploying to public May 12, 2025 21:17 — with GitHub Actions Failure

hemildesai reviewed May 12, 2025

View reviewed changes

.vscode/settings.json Outdated Show resolved Hide resolved

hemildesai reviewed May 12, 2025

View reviewed changes

Add unit tests to LeptonExecutor

94bce5b

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark force-pushed the roclark/leptonai branch from f6d0788 to 94bce5b Compare May 12, 2025 22:07

roclark had a problem deploying to public May 12, 2025 22:08 — with GitHub Actions Failure

hemildesai approved these changes May 12, 2025

View reviewed changes

roclark merged commit 78f54ee into NVIDIA-NeMo:main May 14, 2025
18 of 20 checks passed

roclark deleted the roclark/leptonai branch May 14, 2025 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LeptonExecutor support#224

Add LeptonExecutor support#224
roclark merged 5 commits intoNVIDIA-NeMo:mainfrom
roclark:roclark/leptonai

roclark commented May 8, 2025

Uh oh!

Check notice

Check notice

Check notice

hemildesai left a comment

Uh oh!

hemildesai commented May 9, 2025

Uh oh!

roclark commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

roclark commented May 12, 2025

Uh oh!

Uh oh!

hemildesai left a comment

Uh oh!

roclark commented May 12, 2025

Uh oh!

hemildesai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roclark commented May 8, 2025

Uh oh!

Check notice

Check notice

Check notice

hemildesai left a comment

Choose a reason for hiding this comment

Uh oh!

hemildesai commented May 9, 2025

Uh oh!

roclark commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

roclark commented May 12, 2025

Uh oh!

Uh oh!

hemildesai left a comment

Choose a reason for hiding this comment

Uh oh!

roclark commented May 12, 2025

Uh oh!

hemildesai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants