Skip to content

Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466

Merged
srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom
srivatsankrishnan:main
Apr 14, 2025
Merged

Nemo2.0 with Recipe for Complex CLI Features (Plan B)#466
srivatsankrishnan merged 19 commits intoNVIDIA:mainfrom
srivatsankrishnan:main

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Apr 8, 2025

Summary

There is a bug in Nemo-Run CLI that prevents passing complex forwardRefs. This includes passing of NullTokenizer or HFTokenizer. There could be many features in future where we would have to rely on advanced Nemo-Run CLI that might or might not exist. Since Nemo-Run CLI seems to be lagging in parity on what's possible and what's not, we have come up with a new way to construct a new recipe for each models we support. Since in Nemo each model has a fixed hierarchy of data, log, trainer, plugins, callbacks etc, we can construct a new custom recipe (cloudai_<nemo_recipe_name>). We still need to pass the parallization knobs using Nemo-Run CLI (but good news is that they are supported for passing these).

Trade-offs with this approach (No Free Lunch)

Pros

  • Not having to rely on NeMo-Run CLI. Express all features more natively in python and pass it in recipe objects
  • Closest way to match the performance scripts in Nemo2.0

Cons

  • We are essentially defining our own recipe and matching the existing Nemo2.0 recipes for performance. E.g., The cloudai_llama3_8b method we see is identical to llama3_8b. To converge performance, it might need additional tuning.
  • Verbose. This is not a problem imo can be fixed.

This way, we don't have to rely on support for Nemo-Run CLI for passing complex methods/objects configuration. We can construct the python methods and objects ourselves (we anyway have to do this even for passing these factory functions as string in command line) and pass the methods as arguments in the recipe object. This is very close to the performance scripts maintained in the Nemo2.0 repo. Just that we will rely on our own executor and command generation strategy to support Plugins.

Supported Recipes (Prioritized to support DSE)

Model Name Model Size Comments CloudAI Recipe Name (For Advance Nemo-CLI features)
Llama3-8B 8B BF16/Fp8 cloudai_llama3_8b_recipe
Llama3-70B 70B BF16/Fp8 cloudai_llama3_70b_recipe
Llama3-405B 405B BF16/FP8 cloudai_llama3_405b_recipe
Nemotron3-8B 8B BF16/FP8 cloudai_nemotron3_8b_recipe
Nemotron4-15B 15B BF16/FP8 cloudai_nemotron4_15b_recipe
Nemotron4-340B 340B BF16/FP8 cloudai_nemotron4_340b_recipe

The base recipe is BF16. Fp8 is enabled by passing additional command line arguments. Please take a look at the test files that has commands for passing fp8 arguments.

Container Version Dependent Features

Currently the recipe supports H100. While much of the recipe holds true, there are some specialized configuration enabled for B200/GB200. For H100, we have these defined in the factory functions llama3_70b_bf16_tp_overlap_config and llama3_70b_fp8_tp_overlap_config. For GB200/B200, there are some tp_overlap_config user buffers that needs to be set correctly. Depending upon the container versions, some flags are supported and while others are not.

  • [ ❌ ] GB200/B200 Specific userbuffers. We should pivot to use the latest Nemo containers (25.04).
  • [ ❌ ] FLOPsMeasurementCallback depends upon latest container. Older version 24.12 does not have this feature.

Environment Variables

To run a recipe, you need to set the following environment variables:

  • CLOUDAI_NEMO_TASK: The task mode ("pretrain" or "finetune")
  • This will be taken care automatically by the CloudAI command generator. The test toml today has the field which will be used define the environment variable. This file will be running inside the container needed for configuration.
  • CLOUDAI_NEMO_RECIPE: The recipe name (one of the supported recipes listed above)
    The test toml today has the recipe field which will be used define the environment variable. This file will be running inside the container needed for configuration.

Note on PerfEnvPlugin

PerfEnvPlugin is a plugin feature in Nemo2.0. It provides a single swiss-army knife interface to do these things

  • Enable Vboost (Needs sudo permission to use nvidia-smi command).
  • nccl_pp_comm_chunksize is the same environment variable we set NCCL_P2P_NET_CHUNKSIZE. It has to be be set conditional if pipeline_parallel > 1.
  • layernorm_sm_margin is the same environment variable we set NVTE_FWD_LAYERNORM_SM_MARGIN and NVTE_FWD_LAYERNORM_SM_MARGIN
  • gpu_sm100_or_newer is the same environment variable we set CUDA_DEVICE_MAX_CONNECTIONS.

In cloudAI, we will generate it our way. The enable vboost command translate to an independent srun. We will handle this via the environment variable ENABLE_VBOOST. This will result in generation of the srun command.

In Test Toml, we will define it this way

...
...
[extra_env_vars]
ENABLE_VBOOST = "1" # can be "0" or not defined at all.

Generated output

ENABLE_VBOOST = "1" 
...
...
srun  --output=vboost.out --error=vboost.err bash -c 'sudo nvidia-smi boost-slider --vboost 1'
...
...

For others, we have the pipeline_parallel in cmd_args. We will use it to conditionally generate these environment variables.

Test Plan

CI/CD

Dry Run (For CloudAI integration related)

cloudai dry-run --system-config ../cloudaix/conf/common/system/xxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b
[INFO] Submitted slurm job: 0
[INFO] Job completed: nemo_run_nemotron_15b (iteration 1 of 1)
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_2025-04-09_23-30-05
[INFO] All jobs are complete.

Generated Output

export SLURM_JOB_MASTER_NODE=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n 1)
export CLOUDAI_NEMO_RECIPE=llama3_8b
export CLOUDAI_NEMO_TASK=pretrain
...
...
srun --export=ALL --mpi=pmix --container-image=nvcr.io_nvidia__nemo__24.12.rc3.sqsh --container-mounts=xxxx python /cloudai_install/cloudai_nemorun.py --factory cloudai_llama3_8b_recipe -y trainer.max_steps=xxx trainer.val_check_interval=xxx trainer.num_nodes=xxx trainer.strategy.tensor_model_parallel_size=xxx trainer.strategy.pipeline_model_parallel_size=xx trainer.strategy.context_parallel_size=xx log.ckpt.save_on_train_epoch_end=xx log.ckpt.save_last=False data.micro_batch_size=xx data.global_batch_size=xxx

Tested on Llama3_8b recipe on real cluster.

Training epoch 1, iteration 19/99 | lr: 1.469e-05 | global_batch_size: 128 | global_step: 97 | reduced_train_loss: 11.8 | train_step_timing in s: xxx | consumed_samples: 12544 | val_loss: 11.85
Training epoch 1, iteration 20/99 | lr: 1.484e-05 | global_batch_size: 128 | global_step: 98 | reduced_train_loss: 11.79 | train_step_timing in s: xxx | consumed_samples: 12672 | val_loss: 11.85
Training epoch 1, iteration 21/99 | lr: 1.499e-05 | global_batch_size: 128 | global_step: 99 | reduced_train_loss: 11.79 | train_step_timing in s: xxx | consumed_samples: 12800 | val_loss: 11.85

Tested on Nemotron15b recipe on real cluster

Training epoch 0, iteration 5/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 5 | reduced_train_loss: 12.52 | consumed_samples: 6144
Training epoch 0, iteration 6/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 6 | reduced_train_loss: 12.53 | consumed_samples: 7168
Training epoch 0, iteration 7/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 7 | reduced_train_loss: 12.54 | consumed_samples: 8192
Training epoch 0, iteration 8/99 | lr: 0.0001 | global_batch_size: 1024 | global_step: 8 | reduced_train_loss: 12.55 | consumed_samples: 9216

ToDo: Performance is off by 4%. Investigating why.

[Edit]: This resulted in higher GPU operating frequency within the same power budget and improved end-to-end performance by 4%. The boost slider can be set through the command [nvidia-smi boost-slider –vboost <value>](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87). For more information about this command, including how to get all possible values, run [nvidia-smi boost-slider –help](https://man.archlinux.org/man/nvidia-smi.1.en#nvidia~87)

source: vboost

CloudAI Integration Testing (Benchmarking Case Testing)

$ cloudai run --system-config ../cloudaix/conf/common/system/xxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b
[INFO] Submitted slurm job: 2316948
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_2025-04-10_01-47-08
[INFO] Generated scenario report at results/nemo_run_llama3_8b_2025-04-10_01-47-08/nemo_run_llama3_8b.html
[INFO] All jobs are complete.

2-Node Reproducer for CloudAI Recipe Validation (Real system with Grid Search DSE)

Some of these models are large and would require lots of GPUs. The above models were scaled down to fit in two nodes. The Nemo-Run Dry-run isn't completely fool proof for configuration checks (i.e., even though dry-run would pass, it would still fail in runtime due to misconfiguration of recipe parameters). Also debugging these models at large scale would be unfeasible. So for the purpose of validation of these cloudai_recipes_xxx configuration, we use the scaled down models.

Additional Notes

For the DSE to optimize correctly, the reward needs to be passed to the agent. There is a bug in CloudAI metric reported that prevents this from happening. So this PR which has the fix needs to be merged to make DSE work: #473

More context on this reporter bug:

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review April 10, 2025 09:25
@TaekyungHeo TaekyungHeo added bug Something isn't working enhancement New feature or request and removed feature labels Apr 10, 2025
@srivatsankrishnan srivatsankrishnan removed the bug Something isn't working label Apr 10, 2025
TaekyungHeo
TaekyungHeo previously approved these changes Apr 10, 2025
@srivatsankrishnan
Copy link
Contributor Author

[clarification]: Removed incorrect label assignment. THis is not a cloudAI bug. There is a bug in Nemo-Run. THis is a feature that allows is to support it without having to encounter that Nemo-Run Bug.

@srivatsankrishnan srivatsankrishnan merged commit 70e1ef3 into NVIDIA:main Apr 14, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants