Nemo2.0 with Recipe for Complex CLI Features (Plan B) by srivatsankrishnan · Pull Request #466 · NVIDIA/cloudai

srivatsankrishnan · 2025-04-08T23:49:52Z

Summary

There is a bug in Nemo-Run CLI that prevents passing complex forwardRefs. This includes passing of NullTokenizer or HFTokenizer. There could be many features in future where we would have to rely on advanced Nemo-Run CLI that might or might not exist. Since Nemo-Run CLI seems to be lagging in parity on what's possible and what's not, we have come up with a new way to construct a new recipe for each models we support. Since in Nemo each model has a fixed hierarchy of data, log, trainer, plugins, callbacks etc, we can construct a new custom recipe (cloudai_<nemo_recipe_name>). We still need to pass the parallization knobs using Nemo-Run CLI (but good news is that they are supported for passing these).

Trade-offs with this approach (No Free Lunch)

Pros

Not having to rely on NeMo-Run CLI. Express all features more natively in python and pass it in recipe objects
Closest way to match the performance scripts in Nemo2.0

Cons

We are essentially defining our own recipe and matching the existing Nemo2.0 recipes for performance. E.g., The cloudai_llama3_8b method we see is identical to llama3_8b. To converge performance, it might need additional tuning.
Verbose. This is not a problem imo can be fixed.

This way, we don't have to rely on support for Nemo-Run CLI for passing complex methods/objects configuration. We can construct the python methods and objects ourselves (we anyway have to do this even for passing these factory functions as string in command line) and pass the methods as arguments in the recipe object. This is very close to the performance scripts maintained in the Nemo2.0 repo. Just that we will rely on our own executor and command generation strategy to support Plugins.

Supported Recipes (Prioritized to support DSE)

Model Name	Model Size	Comments	CloudAI Recipe Name (For Advance Nemo-CLI features)
Llama3-8B	8B	BF16/Fp8	cloudai_llama3_8b_recipe
Llama3-70B	70B	BF16/Fp8	cloudai_llama3_70b_recipe
Llama3-405B	405B	BF16/FP8	cloudai_llama3_405b_recipe
Nemotron3-8B	8B	BF16/FP8	cloudai_nemotron3_8b_recipe
Nemotron4-15B	15B	BF16/FP8	cloudai_nemotron4_15b_recipe
Nemotron4-340B	340B	BF16/FP8	cloudai_nemotron4_340b_recipe

The base recipe is BF16. Fp8 is enabled by passing additional command line arguments. Please take a look at the test files that has commands for passing fp8 arguments.

Container Version Dependent Features

Currently the recipe supports H100. While much of the recipe holds true, there are some specialized configuration enabled for B200/GB200. For H100, we have these defined in the factory functions llama3_70b_bf16_tp_overlap_config and llama3_70b_fp8_tp_overlap_config. For GB200/B200, there are some tp_overlap_config user buffers that needs to be set correctly. Depending upon the container versions, some flags are supported and while others are not.

[ ❌ ] GB200/B200 Specific userbuffers. We should pivot to use the latest Nemo containers (25.04).
[ ❌ ] FLOPsMeasurementCallback depends upon latest container. Older version 24.12 does not have this feature.

Environment Variables

To run a recipe, you need to set the following environment variables:

CLOUDAI_NEMO_TASK: The task mode ("pretrain" or "finetune")
This will be taken care automatically by the CloudAI command generator. The test toml today has the field which will be used define the environment variable. This file will be running inside the container needed for configuration.
CLOUDAI_NEMO_RECIPE: The recipe name (one of the supported recipes listed above)
The test toml today has the recipe field which will be used define the environment variable. This file will be running inside the container needed for configuration.

Note on PerfEnvPlugin

PerfEnvPlugin is a plugin feature in Nemo2.0. It provides a single swiss-army knife interface to do these things

Enable Vboost (Needs sudo permission to use nvidia-smi command).
nccl_pp_comm_chunksize is the same environment variable we set NCCL_P2P_NET_CHUNKSIZE. It has to be be set conditional if pipeline_parallel > 1.
layernorm_sm_margin is the same environment variable we set NVTE_FWD_LAYERNORM_SM_MARGIN and NVTE_FWD_LAYERNORM_SM_MARGIN
gpu_sm100_or_newer is the same environment variable we set CUDA_DEVICE_MAX_CONNECTIONS.

In cloudAI, we will generate it our way. The enable vboost command translate to an independent srun. We will handle this via the environment variable ENABLE_VBOOST. This will result in generation of the srun command.

In Test Toml, we will define it this way

...
...
[extra_env_vars]
ENABLE_VBOOST = "1" # can be "0" or not defined at all.

Generated output

ENABLE_VBOOST = "1" 
...
...
srun  --output=vboost.out --error=vboost.err bash -c 'sudo nvidia-smi boost-slider --vboost 1'
...
...

For others, we have the pipeline_parallel in cmd_args. We will use it to conditionally generate these environment variables.

Test Plan

CI/CD

Dry Run (For CloudAI integration related)

cloudai dry-run --system-config ../cloudaix/conf/common/system/xxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b
[INFO] Submitted slurm job: 0
[INFO] Job completed: nemo_run_nemotron_15b (iteration 1 of 1)
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_2025-04-09_23-30-05
[INFO] All jobs are complete.

Generated Output

export SLURM_JOB_MASTER_NODE=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n 1)
export CLOUDAI_NEMO_RECIPE=llama3_8b
export CLOUDAI_NEMO_TASK=pretrain
...
...
srun --export=ALL --mpi=pmix --container-image=nvcr.io_nvidia__nemo__24.12.rc3.sqsh --container-mounts=xxxx python /cloudai_install/cloudai_nemorun.py --factory cloudai_llama3_8b_recipe -y trainer.max_steps=xxx trainer.val_check_interval=xxx trainer.num_nodes=xxx trainer.strategy.tensor_model_parallel_size=xxx trainer.strategy.pipeline_model_parallel_size=xx trainer.strategy.context_parallel_size=xx log.ckpt.save_on_train_epoch_end=xx log.ckpt.save_last=False data.micro_batch_size=xx data.global_batch_size=xxx

Tested on Llama3_8b recipe on real cluster.

Training epoch 1, iteration 19/99 | lr: 1.469e-05 | global_batch_size: 128 | global_step: 97 | reduced_train_loss: 11.8 | train_step_timing in s: xxx | consumed_samples: 12544 | val_loss: 11.85
Training epoch 1, iteration 20/99 | lr: 1.484e-05 | global_batch_size: 128 | global_step: 98 | reduced_train_loss: 11.79 | train_step_timing in s: xxx | consumed_samples: 12672 | val_loss: 11.85
Training epoch 1, iteration 21/99 | lr: 1.499e-05 | global_batch_size: 128 | global_step: 99 | reduced_train_loss: 11.79 | train_step_timing in s: xxx | consumed_samples: 12800 | val_loss: 11.85

Tested on Nemotron15b recipe on real cluster

Training epoch 0, iteration 5/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 5 | reduced_train_loss: 12.52 | consumed_samples: 6144
Training epoch 0, iteration 6/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 6 | reduced_train_loss: 12.53 | consumed_samples: 7168
Training epoch 0, iteration 7/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 7 | reduced_train_loss: 12.54 | consumed_samples: 8192
Training epoch 0, iteration 8/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 8 | reduced_train_loss: 12.55 | consumed_samples: 9216

ToDo: Performance is off by 4%. Investigating why.

[Edit]: This resulted in higher GPU operating frequency within the same power budget and improved end-to-end performance by 4%. The boost slider can be set through the command [nvidia-smi boost-slider –vboost <value>](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87). For more information about this command, including how to get all possible values, run [nvidia-smi boost-slider –help](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87)

source: vboost

CloudAI Integration Testing (Benchmarking Case Testing)

$ cloudai run --system-config ../cloudaix/conf/common/system/xxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b
[INFO] Submitted slurm job: 2316948
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_2025-04-10_01-47-08
[INFO] Generated scenario report at results/nemo_run_llama3_8b_2025-04-10_01-47-08/nemo_run_llama3_8b.html
[INFO] All jobs are complete.

2-Node Reproducer for CloudAI Recipe Validation (Real system with Grid Search DSE)

Some of these models are large and would require lots of GPUs. The above models were scaled down to fit in two nodes. The Nemo-Run Dry-run isn't completely fool proof for configuration checks (i.e., even though dry-run would pass, it would still fail in runtime due to misconfiguration of recipe parameters). Also debugging these models at large scale would be unfeasible. So for the purpose of validation of these cloudai_recipes_xxx configuration, we use the scaled down models.

Llama3_8b : Logs
Llama3_70b: Logs
Nemotron4_15b: Logs
Nemotron4_340b: Logs

Additional Notes

For the DSE to optimize correctly, the reward needs to be passed to the agent. There is a bug in CloudAI metric reported that prevents this from happening. So this PR which has the fix needs to be merged to make DSE work: #473

More context on this reporter bug:

src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py

tests/slurm_command_gen_strategy/test_nemo_run_slurm_command_gen_strategy.py

src/cloudai/workloads/nemo_run/cloudai_nemorun.py

srivatsankrishnan · 2025-04-10T22:28:51Z

[clarification]: Removed incorrect label assignment. THis is not a cloudAI bug. There is a bug in Nemo-Run. THis is a feature that allows is to support it without having to encounter that Nemo-Run Bug.

src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py

src/cloudai/workloads/nemo_run/cloudai_nemorun.py

srivatsankrishnan added 10 commits April 8, 2025 16:35

reciepe method for complex CLI

610542b

llama70b configs

c1ce354

Merge branch 'main' into main

5df4c14

nemotron15b/nemotron340b, llama31-405b models

443af4a

vulture

835ad59

integrate recipe into command gen strategy

93c8663

add unit test

d0e8587

enable vboost command

57e3bd1

unit test for vboost cmd

080eab3

set nccl env variable if pp>1 (PerfEnvPlugin

c7e32cf

srivatsankrishnan marked this pull request as ready for review April 10, 2025 09:25

srivatsankrishnan requested review from TaekyungHeo, amaslenn and srinivas212 as code owners April 10, 2025 09:25

TaekyungHeo added the feature label Apr 10, 2025

amaslenn reviewed Apr 10, 2025

View reviewed changes

fix

f58be1a

TaekyungHeo added bug Something isn't working enhancement New feature or request and removed feature labels Apr 10, 2025

srivatsankrishnan removed the bug Something isn't working label Apr 10, 2025

Merge branch 'main' into main

8004a0d

srivatsankrishnan added the feature label Apr 10, 2025

srivatsankrishnan added 3 commits April 10, 2025 13:29

fall back on default recipe with warning

0d24ebd

andrei's fixes

886c0e0

vulture

94a1ba3

TaekyungHeo previously approved these changes Apr 10, 2025

View reviewed changes

amaslenn reviewed Apr 11, 2025

View reviewed changes

src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

src/cloudai/workloads/nemo_run/cloudai_nemorun.py Show resolved Hide resolved

working version

480970a

srivatsankrishnan dismissed TaekyungHeo’s stale review via 480970a April 14, 2025 05:53

srivatsankrishnan added 2 commits April 13, 2025 22:57

space for warning + ruff

10d980a

Merge branch 'main' into main

3d6b20c

amaslenn approved these changes Apr 14, 2025

View reviewed changes

TaekyungHeo approved these changes Apr 14, 2025

View reviewed changes

Merge branch 'main' into main

5d779c0

srivatsankrishnan merged commit 70e1ef3 into NVIDIA:main Apr 14, 2025
2 checks passed

srivatsankrishnan mentioned this pull request Apr 14, 2025

Update output path for DSE runs #473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466

Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466
srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom
srivatsankrishnan:main

srivatsankrishnan commented Apr 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srivatsankrishnan commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

srivatsankrishnan commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Trade-offs with this approach (No Free Lunch)

Pros

Cons

Supported Recipes (Prioritized to support DSE)

Container Version Dependent Features

Environment Variables

Note on PerfEnvPlugin

Test Plan

Dry Run (For CloudAI integration related)

CloudAI Integration Testing (Benchmarking Case Testing)

2-Node Reproducer for CloudAI Recipe Validation (Real system with Grid Search DSE)

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srivatsankrishnan commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srivatsankrishnan commented Apr 8, 2025 •

edited

Loading