Adding Secret support to Lepton Executor#382
Closed
zoeyz101 wants to merge 64 commits intoNVIDIA-NeMo:mainfrom
Closed
Adding Secret support to Lepton Executor#382zoeyz101 wants to merge 64 commits intoNVIDIA-NeMo:mainfrom
zoeyz101 wants to merge 64 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Add logs dir to container mount for ray slurm Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix tests Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…DIA-NeMo#286) * finetune on dgxcloud with nemo-run and deploy on bedrock example Signed-off-by: Zoey Zhang <zozhang@nvidia.com> * removing trailing slash Signed-off-by: Zoey Zhang <zozhang@nvidia.com> * reformatting notebook Signed-off-by: Zoey Zhang <zozhang@nvidia.com> * adding EOF Signed-off-by: Zoey Zhang <zozhang@nvidia.com> --------- Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Roee Landesman <roeeland@cisco.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…#285) * fix docs tutorial links and add intro to guides/index.md Signed-off-by: Hemil Desai <hemild@nvidia.com> * Adding project.json/versions1.json, and update conf.py Signed-off-by: Andrew Schilling <aschilling@nvidia.com> * fixes Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Andrew Schilling <aschilling@nvidia.com> Co-authored-by: Andrew Schilling <aschilling@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Andrew Schilling <aschilling@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…es (NVIDIA-NeMo#295) * Add thread pool to get status of jobs inside experiment Signed-off-by: Hemil Desai <hemild@nvidia.com> * Add thread pools to experiment run Signed-off-by: Hemil Desai <hemild@nvidia.com> * fixes Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…NeMo#251) - do prepare stage only from single process or rank - for --node-rank, also look for SLURM_NODEID Signed-off-by: Pramod Kumbhar <prkumbhar@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Upgrade skypilot to v0.10.0, introduce network_tier Signed-off-by: Roee Landesman <roeeland@cisco.com> * add unit tests Signed-off-by: Roee Landesman <roeeland@cisco.com> --------- Signed-off-by: Roee Landesman <roeeland@cisco.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* ci(fix): Use GITHUB_TOKEN for community bot Signed-off-by: oliver könig <okoenig@nvidia.com> * f Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* remove breaking torchrun config for single-node runs Signed-off-by: Roee Landesman <roeeland@cisco.com> * fix lint Signed-off-by: Roee Landesman <roeeland@cisco.com> --------- Signed-off-by: Roee Landesman <roeeland@cisco.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* update lepton executor to include custom prelaunch commands section Signed-off-by: ansjindal <ansjindal@nvidia.com> * add test for prelaunch section Signed-off-by: ansjindal <ansjindal@nvidia.com> * add more tests for checking the pre-launch-commands section Signed-off-by: ansjindal <ansjindal@nvidia.com> * update lepton executor tests Signed-off-by: ansjindal <ansjindal@nvidia.com> --------- Signed-off-by: ansjindal <ansjindal@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Create CHANGELOG.md Signed-off-by: Pablo Garay <palenq@gmail.com> * Add entries to CHANGELOG.md Signed-off-by: Pablo Garay <palenq@gmail.com> * Update CHANGELOG.md Signed-off-by: Pablo Garay <palenq@gmail.com> * Update CHANGELOG.md Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Pablo Garay <palenq@gmail.com> * add links --------- Signed-off-by: Pablo Garay <palenq@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Correctly append tar files for packaging Signed-off-by: Sahil Modi <samodi@nvidia.com> * tests Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Sahil Modi <samodi@nvidia.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…IA-NeMo#319) Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…VIDIA-NeMo#320) * Specify nodes for gpu metrics collection and split data to each rank Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> * Fix unit test Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> --------- Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Fixing documentation layout Signed-off-by: Andrew Schilling <aschilling@nvidia.com> * documentation.md Signed-off-by: Andrew Schilling <aschilling@nvidia.com> * Removing live-server Signed-off-by: Andrew Schilling <aschilling@nvidia.com> * Correctin .vscode Signed-off-by: Andrew Schilling <aschilling@nvidia.com> --------- Signed-off-by: Andrew Schilling <aschilling@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Andrew Schilling <aschilling@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* fix: Emit exit-code of docker runs Signed-off-by: oliver könig <okoenig@nvidia.com> * fix test Signed-off-by: oliver könig <okoenig@nvidia.com> * fixes Signed-off-by: oliver könig <okoenig@nvidia.com> * refactor Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanup Signed-off-by: oliver könig <okoenig@nvidia.com> * add scheduler test Signed-off-by: oliver könig <okoenig@nvidia.com> * more scheduler tests Signed-off-by: oliver könig <okoenig@nvidia.com> * test executor Signed-off-by: oliver könig <okoenig@nvidia.com> * formatting Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* [🤖]: Howdy folks, let's bump NeMo Run to `0.8.0rc0.dev0` ! Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * lintfix double quotes Signed-off-by: Pablo Garay <pagaray@nvidia.com> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* feat: add copyright check * feat: add copyright check * feat: add copyright check * feat: add copyright check * feat: add copyright check Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Add port parameter to SSHTunnel Signed-off-by: Igor Gitman <igitman@nvidia.com> * Fix ci Signed-off-by: Igor Gitman <igitman@nvidia.com> * Fix copyright Signed-off-by: Igor Gitman <igitman@nvidia.com> * Fix copyright Signed-off-by: Igor Gitman <igitman@nvidia.com> * Fix copyright Signed-off-by: Igor Gitman <igitman@nvidia.com> --------- Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Update ray template Signed-off-by: Hemil Desai <hemild@nvidia.com> * add ray enroot exec template Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
…Mo#380) Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
90478e0 to
9cf1d77
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.