Skip to content

Add LeptonExecutor support#224

Merged
roclark merged 5 commits intoNVIDIA-NeMo:mainfrom
roclark:roclark/leptonai
May 14, 2025
Merged

Add LeptonExecutor support#224
roclark merged 5 commits intoNVIDIA-NeMo:mainfrom
roclark:roclark/leptonai

Conversation

@roclark
Copy link
Copy Markdown
Contributor

@roclark roclark commented May 8, 2025

The LeptonExecutor adds support for launching jobs on NVIDIA DGX Cloud Lepton using the Lepton Python SDK. The Lepton SDK needs to be installed with pip install nemo-run[lepton] or pip install leptonai and authenticate with the cluster using the command available on the DGX Cloud Lepton UI in the Settings > Tokens page.

The LeptonExecutor can be defined using the following example:

def your_lepton_executor(nodes: int, gpus_per_node: int, container_image: str):
    # Ensure these are set correctly for your DGX Cloud environment
    # You might fetch these from environment variables or a config file
    resource_shape = "gpu.8xh100-80gb" # Replace with your desired resource shape representing the number of GPUs in a pod
    node_group = "my-node-group"  # The node group to run the job in
    nemo_run_dir = "/nemo-workspace/nemo-run"  # The NeMo-Run directory where experiments are saved
    # Define the remote storage directory that will be mounted in the job pods
    # Ensure the path specified here contains your NEMORUN_HOME
    storage_path = "/nemo-workspace" # The remote storage directory to mount in jobs
    mount_path = "/nemo-workspace" # The path where the remote storage directory will be mounted inside the container

    executor = run.LeptonExecutor(
        resource_shape=resource_shape,
        node_group=node_group,
        container_image=container_image,
        nodes=nodes,
        nemo_run_dir=nemo_run_dir,
        gpus_per_node=gpus_per_node,
        mounts=[{"path": storage_path, "mount_path": mount_path}],
        # Optional: Add custom environment variables or PyTorch specs if needed
        env_vars=common_envs(),
        # packager=run.GitArchivePackager() # Choose appropriate packager
    )
    return executor

# Example usage:
executor = your_lepton_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image")

Launching jobs with the LeptonExecutor will queue a Batch Job on the DGX Cloud Lepton cluster in the specified node group.

@roclark roclark requested review from ericharper and hemildesai May 8, 2025 17:46
@roclark roclark added the enhancement New feature or request label May 8, 2025
for line in logs:
print(line)

def cleanup(self, handle: str): ...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
)
self.experiment_id = exp_id

def get_launcher_prefix(self) -> Optional[list[str]]:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
return None
executor.cancel(job_id)

def list(self) -> list[ListAppResponse]: ...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
Copy link
Copy Markdown
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, can you make sure the format, lint and test checks pass?

@hemildesai
Copy link
Copy Markdown
Contributor

Also, can you add some basic tests or create an issue to add tests for the lepton executor later?

@roclark
Copy link
Copy Markdown
Contributor Author

roclark commented May 9, 2025

Thanks for the review, @hemildesai! I will clear up the failing tests here and add some new tests with the PR early next week. I changed up some more functionality today based on feedback I got offline on the storage API as well.

roclark added 4 commits May 12, 2025 07:33
NVIDIA DGX Cloud Lepton is another platform available for launching
distributed jobs using NeMo-Run. The new LeptonExecutor leverages the
Lepton Python SDK to authenticate with a DGX Cloud Lepton cluster and
launch jobs on available resources.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Allow users to specify custom mounts using Lepton's Filesystem
functionality.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Handle more possible failure scenarios for the LeptonExecutor where the
code could run into a bad state and the user should be alerted with
helpful debug info.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Running a low-resource pod to copy experiment data to the Lepton cluster
is more reliable and broadly compatible with various cluster types
versus the storage API.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
@roclark
Copy link
Copy Markdown
Contributor Author

roclark commented May 12, 2025

@hemildesai, added some tests and cleared up everything else that was failing in the PR. Let me know if there's anything else I need to change, thanks! :)

Copy link
Copy Markdown
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one small comment.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
@roclark
Copy link
Copy Markdown
Contributor Author

roclark commented May 12, 2025

Alright, should be resolved now.

Copy link
Copy Markdown
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks 🎉

@roclark roclark merged commit 78f54ee into NVIDIA-NeMo:main May 14, 2025
18 of 20 checks passed
@roclark roclark deleted the roclark/leptonai branch May 14, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants