Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466
Merged
srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom Apr 14, 2025
Merged
Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom
srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom
Conversation
amaslenn
reviewed
Apr 10, 2025
tests/slurm_command_gen_strategy/test_nemo_run_slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
TaekyungHeo
previously approved these changes
Apr 10, 2025
Contributor
Author
|
[clarification]: Removed incorrect label assignment. THis is not a cloudAI bug. There is a bug in Nemo-Run. THis is a feature that allows is to support it without having to encounter that Nemo-Run Bug. |
amaslenn
reviewed
Apr 11, 2025
amaslenn
approved these changes
Apr 14, 2025
TaekyungHeo
approved these changes
Apr 14, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
There is a bug in Nemo-Run CLI that prevents passing complex forwardRefs. This includes passing of
NullTokenizerorHFTokenizer. There could be many features in future where we would have to rely on advanced Nemo-Run CLI that might or might not exist. Since Nemo-Run CLI seems to be lagging in parity on what's possible and what's not, we have come up with a new way to construct a new recipe for each models we support. Since in Nemo each model has a fixed hierarchy ofdata,log,trainer,plugins,callbacksetc, we can construct a new custom recipe (cloudai_<nemo_recipe_name>). We still need to pass the parallization knobs using Nemo-Run CLI (but good news is that they are supported for passing these).Trade-offs with this approach (No Free Lunch)
Pros
Cons
cloudai_llama3_8bmethod we see is identical tollama3_8b. To converge performance, it might need additional tuning.This way, we don't have to rely on support for Nemo-Run CLI for passing complex methods/objects configuration. We can construct the python methods and objects ourselves (we anyway have to do this even for passing these factory functions as string in command line) and pass the methods as arguments in the recipe object. This is very close to the performance scripts maintained in the Nemo2.0 repo. Just that we will rely on our own executor and command generation strategy to support Plugins.
Supported Recipes (Prioritized to support DSE)
The base recipe is BF16. Fp8 is enabled by passing additional command line arguments. Please take a look at the test files that has commands for passing
fp8arguments.Container Version Dependent Features
Currently the recipe supports H100. While much of the recipe holds true, there are some specialized configuration enabled for B200/GB200. For H100, we have these defined in the factory functions
llama3_70b_bf16_tp_overlap_configandllama3_70b_fp8_tp_overlap_config. For GB200/B200, there are sometp_overlap_configuser buffers that needs to be set correctly. Depending upon the container versions, some flags are supported and while others are not.25.04).24.12does not have this feature.Environment Variables
To run a recipe, you need to set the following environment variables:
CLOUDAI_NEMO_TASK: The task mode ("pretrain" or "finetune")CLOUDAI_NEMO_RECIPE: The recipe name (one of the supported recipes listed above)The test toml today has the
recipefield which will be used define the environment variable. This file will be running inside the container needed for configuration.Note on PerfEnvPlugin
PerfEnvPlugin is a plugin feature in Nemo2.0. It provides a single swiss-army knife interface to do these things
sudopermission to usenvidia-smicommand).nccl_pp_comm_chunksizeis the same environment variable we setNCCL_P2P_NET_CHUNKSIZE. It has to be be set conditional ifpipeline_parallel> 1.layernorm_sm_marginis the same environment variable we setNVTE_FWD_LAYERNORM_SM_MARGINandNVTE_FWD_LAYERNORM_SM_MARGINgpu_sm100_or_neweris the same environment variable we setCUDA_DEVICE_MAX_CONNECTIONS.In cloudAI, we will generate it our way. The enable
vboostcommand translate to an independentsrun. We will handle this via the environment variableENABLE_VBOOST. This will result in generation of thesruncommand.In Test Toml, we will define it this way
Generated output
For others, we have the
pipeline_parallelincmd_args. We will use it to conditionally generate these environment variables.Test Plan
CI/CD
Dry Run (For CloudAI integration related)
Generated Output
Tested on Llama3_8b recipe on real cluster.
Tested on Nemotron15b recipe on real cluster
ToDo: Performance is off by 4%. Investigating why.
[Edit]:
This resulted in higher GPU operating frequency within the same power budget and improved end-to-end performance by 4%. The boost slider can be set through the command [nvidia-smi boost-slider –vboost <value>](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87). For more information about this command, including how to get all possible values, run [nvidia-smi boost-slider –help](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87)source: vboost
CloudAI Integration Testing (Benchmarking Case Testing)
2-Node Reproducer for CloudAI Recipe Validation (Real system with Grid Search DSE)
Some of these models are large and would require lots of GPUs. The above models were scaled down to fit in two nodes. The Nemo-Run Dry-run isn't completely fool proof for configuration checks (i.e., even though dry-run would pass, it would still fail in runtime due to misconfiguration of recipe parameters). Also debugging these models at large scale would be unfeasible. So for the purpose of validation of these
cloudai_recipes_xxxconfiguration, we use the scaled down models.Additional Notes
For the DSE to optimize correctly, the reward needs to be passed to the agent. There is a bug in CloudAI metric reported that prevents this from happening. So this PR which has the fix needs to be merged to make DSE work: #473
More context on this reporter bug: