Conversation
| for line in logs: | ||
| print(line) | ||
|
|
||
| def cleanup(self, handle: str): ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note
| ) | ||
| self.experiment_id = exp_id | ||
|
|
||
| def get_launcher_prefix(self) -> Optional[list[str]]: |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
| return None | ||
| executor.cancel(job_id) | ||
|
|
||
| def list(self) -> list[ListAppResponse]: ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note
hemildesai
left a comment
There was a problem hiding this comment.
Mostly LGTM, can you make sure the format, lint and test checks pass?
|
Also, can you add some basic tests or create an issue to add tests for the lepton executor later? |
|
Thanks for the review, @hemildesai! I will clear up the failing tests here and add some new tests with the PR early next week. I changed up some more functionality today based on feedback I got offline on the storage API as well. |
NVIDIA DGX Cloud Lepton is another platform available for launching distributed jobs using NeMo-Run. The new LeptonExecutor leverages the Lepton Python SDK to authenticate with a DGX Cloud Lepton cluster and launch jobs on available resources. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Allow users to specify custom mounts using Lepton's Filesystem functionality. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Handle more possible failure scenarios for the LeptonExecutor where the code could run into a bad state and the user should be alerted with helpful debug info. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Running a low-resource pod to copy experiment data to the Lepton cluster is more reliable and broadly compatible with various cluster types versus the storage API. Signed-Off-By: Robert Clark <roclark@nvidia.com>
|
@hemildesai, added some tests and cleared up everything else that was failing in the PR. Let me know if there's anything else I need to change, thanks! :) |
hemildesai
left a comment
There was a problem hiding this comment.
LGTM, just one small comment.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
|
Alright, should be resolved now. |
hemildesai
left a comment
There was a problem hiding this comment.
Looks great, thanks 🎉
The LeptonExecutor adds support for launching jobs on NVIDIA DGX Cloud Lepton using the Lepton Python SDK. The Lepton SDK needs to be installed with
pip install nemo-run[lepton]orpip install leptonaiand authenticate with the cluster using the command available on the DGX Cloud Lepton UI in the Settings > Tokens page.The LeptonExecutor can be defined using the following example:
Launching jobs with the LeptonExecutor will queue a Batch Job on the DGX Cloud Lepton cluster in the specified node group.