Skip to content

Commit 74f6303

Browse files
nv-mollysscsudhakaranrhmukundanmalay-nagdamollys
authored
Llmb nemo r2.4.0 (#14634)
* Set attention backend to "auto" for Nemotron-H (#14042) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Adding TFLOPS per GPU Support for Finetuning (#14048) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Enable optimizations for Nemotron-H (#13915) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Disable checkpointing for Nemotron-H (#14001) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Cherry pick ea4b47f (#13896) * perf scripts updates (#13456) * gb200 recommended cfgs csv fix Signed-off-by: Malay Nagda <malayn@nvidia.com> * 495b h100 fix Signed-off-by: Malay Nagda <malayn@nvidia.com> * gb200 79b bf16 20 layers recompute Signed-off-by: Malay Nagda <malayn@nvidia.com> * csv format fix Signed-off-by: Malay Nagda <malayn@nvidia.com> * csv format fix Signed-off-by: Malay Nagda <malayn@nvidia.com> * 70b, 340b no fsdp Signed-off-by: Malay Nagda <malayn@nvidia.com> * dsv3 perf mode Signed-off-by: Malay Nagda <malayn@nvidia.com> * Apply isort and black reformatting Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> * dsv3 perf mode peft Signed-off-by: Malay Nagda <malayn@nvidia.com> * Apply isort and black reformatting Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> * dsv3 callback Signed-off-by: Malay Nagda <malayn@nvidia.com> * Apply isort and black reformatting Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> * dsv3 callback Signed-off-by: Malay Nagda <malayn@nvidia.com> * cudagraphs Signed-off-by: Malay Nagda <malayn@nvidia.com> --------- Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> Signed-off-by: malay-nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com> * import missing callbacks in deepseek recipe --------- Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> Signed-off-by: malay-nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com> Co-authored-by: mollys <mollys@mollys.nvidia.com> * Onboard LLAMA4 Maverick Finetuning(SFT) with SQUAD Dataset Download Fix (#13926) * Onboard LLAMA4 Maverick Finetuning(SFT) with SQUAD Dataset Download Fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks for llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding flags for skipping the separate SLURM jobs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Arg parse changes and tweaks to remove squad dataset check Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Reverting the args for separate SLURM jobs as there is a dependency (run.Partial) with the finetune job Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer due to compatability Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Separate args to have control over the 3 SLURM jobs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Enabling TokenDropCallback and tp_comm_overlap Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Changes to introduce flags for enabling/disabling the 3 SLURM jobs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Changing the exp_name based on the SLURM job being run Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Argparse Changes Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix for standalone checkpoint and dataload jobs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NUMA Factor error for dataset and checkpoint download job Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix for CUDA Graph error in this version Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Changes to peft_scheme Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Changes to set peft_scheme to None Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Replacing the file name from finetune_ to sft_ Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Updates to exp_name format Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Reverting defaults for --finetuning arg to also include lora Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweak comment(s) in the finetuine llama4 e128 script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Reverting the recommended config order change in for b200 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> * Add profiling changes (#13484) * add profiling changes Signed-off-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> * More model changes Signed-off-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> --------- Signed-off-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Co-authored-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> Co-authored-by: ashbhandare <ashbhandare@users.noreply.github.com> * Port nemotron 25.04 patch to r3.2.0 based llmb-nemo (#13533) * port nemotron patch to r3.2.0 based llmb-nemo Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> * update template for experiment names Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> * review based updates Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> --------- Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> Co-authored-by: Barys Dubauski <bdubauski@nvdia.com> * Port run-ai patch to llmb-nemo branch (#13573) * Port run-ai patch to llmb-nemo branch Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> * Apply isort and black reformatting Signed-off-by: bdubauski <bdubauski@users.noreply.github.com> --------- Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> Signed-off-by: bdubauski <bdubauski@users.noreply.github.com> Co-authored-by: Barys Dubauski <bdubauski@nvdia.com> Co-authored-by: bdubauski <bdubauski@users.noreply.github.com> * Add grok recipe (#13586) * Add grok recipe Signed-off-by: mollys <mollys@mollys.nvidia.com> * Apply isort and black reformatting Signed-off-by: nv-mollys <nv-mollys@users.noreply.github.com> Signed-off-by: mollys <mollys@mollys.nvidia.com> * Add copyright header Signed-off-by: mollys <mollys@mollys.nvidia.com> --------- Signed-off-by: mollys <mollys@mollys.nvidia.com> Signed-off-by: nv-mollys <nv-mollys@users.noreply.github.com> Co-authored-by: mollys <mollys@mollys.nvidia.com> Co-authored-by: nv-mollys <nv-mollys@users.noreply.github.com> * transformers_offline=0 and profile changes to llama3.1 405b (#13655) * transformers_offline=0 and profile changes to llama3.1 405b Signed-off-by: Sebastian Alberdi <salberdi@salberdi-mlt.client.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * nccl added Signed-off-by: Sebastian Alberdi <salberdi@salberdi-mlt.client.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@salberdi-mlt.client.nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-preos02.a51.clusters.nvidia.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Add perf recipe script for Nemotron-H-56B (#13691) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Pretraining Deepseek changes for LLMB (#13752) * working changes Signed-off-by: ashbhandare <abhandare@nvidia.com> * cleanup Signed-off-by: ashbhandare <abhandare@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Signed-off-by: ashbhandare <abhandare@nvidia.com> * make profiling steps overridable Signed-off-by: ashbhandare <abhandare@nvidia.com> * add nccl trace ability, cleanup Signed-off-by: ashbhandare <abhandare@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Signed-off-by: ashbhandare <abhandare@nvidia.com> --------- Signed-off-by: ashbhandare <abhandare@nvidia.com> Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Co-authored-by: Aishwarya Bhandare <abhandare@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: ashbhandare <ashbhandare@users.noreply.github.com> * Adding FP8 Default Configs for LLAMA4 Maverick (#13698) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Changing the tokenizer from Scout to Maverick in the pretrain LLAMA4 LLM Recipe (#13664) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaking LLama4 Maverick PreTrain file to adapt to the user configs parameter format (#13690) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Grok Nvbug 5311566 (#13765) * remove unnecessary nemo root check * remove comment and unused packages --------- Co-authored-by: mollys <mollys@mollys.nvidia.com> * Grok nccl trace fix (#13769) * remove unnecessary nemo root check * remove comment and unused packages * transformers online * fix env vars * setting transformers offline here doesn't work --------- Co-authored-by: mollys <mollys@mollys.nvidia.com> * Fix for config params in pretrain llama4 e128 (#13764) * Fix for config params in pretrain llama4 e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Ignoring unrelated configs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Cleanup of configs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding all the params in get_user_configs func Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Nsys Tweaks to llama4 pretrain (#13778) * Removign hardcoding of nsys profiling ranges Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding NCCL Trace support for pretrain recipe (llama4) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> * Disable checkpointing for Nemotron-H (#13786) * Disable checkpointing for Nemotron-H Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Nemotron-H NCCL trace support Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Apply isort and black reformatting Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> --------- Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Co-authored-by: scsudhakaran <scsudhakaran@users.noreply.github.com> * Llmb nemo r2.3.0 (#13806) * set NCCL_NET_GDR_LEVEL=PHB for deepseekv3, grok1_314b, llama31_405b, llama4_e128, nemotron4_15b+340b, nemotronh_56b Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * add all environment variables to container environment (#13808) Co-authored-by: mollys <mollys@mollys.nvidia.com> * fix numactl (#13809) Co-authored-by: mollys <mollys@mollys.nvidia.com> * Llmb nemo r2.3.0 (#13807) * set NCCL_NET_GDR_LEVEL=PHB for deepseekv3, grok1_314b, llama31_405b, llama4_e128, nemotron4_15b+340b, nemotronh_56b Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * made experiment naming match standard Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * standardized exp_name for relevant workloads Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * fixing QA checkpoint bug for nemotron4 (#13843) * fixing QA checkpoint bug for nemotron4 * Apply isort and black reformatting Signed-off-by: sshiddib <sshiddib@users.noreply.github.com> * arg name change * Apply isort and black reformatting Signed-off-by: sshiddib <sshiddib@users.noreply.github.com> --------- Signed-off-by: sshiddib <sshiddib@users.noreply.github.com> Co-authored-by: Sharada Shiddibhavi <sshiddibhavi@cw-dfw-cs-001-vscode-02.cm.cluster> Co-authored-by: sshiddib <sshiddib@users.noreply.github.com> * Add gpu metrics option (#13882) * gpu metrics option Signed-off-by: ashbhandare <abhandare@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> * specify nemo run commit Signed-off-by: ashbhandare <abhandare@nvidia.com> * Apply isort and black reformatting Signed-off-by: artbataev <artbataev@users.noreply.github.com> * Fix linting error Signed-off-by: ashbhandare <abhandare@nvidia.com> --------- Signed-off-by: ashbhandare <abhandare@nvidia.com> Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: ashbhandare <ashbhandare@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> * LLAMA4 Maverick SFT Recipe + SQUAD Dataset Download Error Fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Revert "LLAMA4 Maverick SFT Recipe + SQUAD Dataset Download Error Fix" This reverts commit 755fd36. * fix nemo/collections/llm/recipes/__init__.py * fix nemo/collections/llm/recipes/deepseek_v3.py * new line * fix nemo/collections/llm/recipes/llama4_e128.py * fix scripts/performance/llm/finetune_llama4_e128.py * small updates for grok * modified: scripts/performance/llm/pretrain_grok1_314b.py modified: scripts/performance/llm/pretrain_nemotron4_340b.py * manually add util changes to helpers.py and executors.py * Fix in Nemotron-H script (#14251) * Fix in Nemotron-H script Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Apply isort and black reformatting Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: artbataev <artbataev@users.noreply.github.com> * Fix in Nemotron-H perf script Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> --------- Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> * updated with some things from NeMo main (double_buffer) (#14305) * updated with some things from NeMo main (double_buffer) Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> * Took out cudNN lines b/c of regression with cuDNN normalization kernel (#14360) * added conditional cudnn to align with nemo main (#14324) * added conditional cudnn to align with nemo main Signed-off-by: Sebastian Alberdi <salberdi@cw-dfw-cs-001-login-02.cm.cluster> * fixed num optimizer instances bug Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: Sebastian Alberdi <salberdi@cw-dfw-cs-001-login-02.cm.cluster> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> * Adds pyxis container writable and no mount home flags (#14386) * Add pyxis flags for writable and no-mount home. Signed-off-by: Alex Filby <afilby@nvidia.com> * Apply isort and black reformatting Signed-off-by: sudostock <sudostock@users.noreply.github.com> --------- Signed-off-by: Alex Filby <afilby@nvidia.com> Signed-off-by: sudostock <sudostock@users.noreply.github.com> Co-authored-by: sudostock <sudostock@users.noreply.github.com> * Update DeepSeek-V3 perf scripts (#14377) * Fix callbacks in DSV3 script (#14350) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Changes to grok to alleviate error: TypeError: '>' not supported betw… (#14326) * Changes to grok to alleviate error: TypeError: '>' not supported between instances of 'str' and 'int' * Apply isort and black reformatting Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * Made the changes where it's not default values hard coded. User can change thru cli * Apply isort and black reformatting Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * made suggested changes. Verified successful. * Apply isort and black reformatting Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * Made suggested change. --------- Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> Co-authored-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * Make VBoost activation conditional (#14453) * Refactor performance scripts to use build_perf_env_plugin function * Replaced direct instantiation of PerfEnvPlugin with build_perf_env_plugin in multiple LLM finetuning and pretraining scripts for consistency and maintainability. * Added build_perf_env_plugin function to helpers.py to streamline performance environment setup based on GPU and pipeline parallelism settings. This change enhances code readability and reduces redundancy across scripts. * control vboost enablement via cli * Update finetune_llama4_e128.py to import build_perf_env_plugin function * Added the build_perf_env_plugin import to enhance performance environment setup consistency across scripts. This change aligns with recent refactoring efforts to streamline performance script management. --------- Co-authored-by: Barys Dubauski <bdubauski@nvdia.com> * turned off tp overlap comms for >128 gpus on gb200 so jobs are functi… (#14460) * turned off tp overlap comms for >128 gpus on gb200 so jobs are functional Signed-off-by: Sebastian Alberdi <salberdi@nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * Remove NCCL tracing option and clean up imports in performance scripts (#14467) * Remove NCCL tracing option and clean up imports in performance scripts. Updated multiple LLM finetuning and pretraining scripts to eliminate the use of PerfEnvPlugin, enhancing consistency and maintainability. * Apply isort and black reformatting Signed-off-by: bdubauski <bdubauski@users.noreply.github.com> --------- Signed-off-by: bdubauski <bdubauski@users.noreply.github.com> Co-authored-by: Barys Dubauski <bdubauski@nvdia.com> Co-authored-by: bdubauski <bdubauski@users.noreply.github.com> * Disable tp_comm_overlap for 512 gpus on GB200 (#14474) ...to fix functionality issue Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Workaround for MXFP8 functionality issue (#14426) * Workaround for MXFP8 functionality issue Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> * Apply isort and black reformatting Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> --------- Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Co-authored-by: scsudhakaran <scsudhakaran@users.noreply.github.com> * previous commit was buggy (#14477) * previous was buggy Signed-off-by: Sebastian Alberdi <salberdi@nvidia.com> * Apply isort and black reformatting Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> --------- Signed-off-by: Sebastian Alberdi <salberdi@nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> * checkpoint save/load functionality with HF token (#14538) * checkpoint save/load functionality with HF token * Apply isort and black reformatting Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * using use_hf_tokenizer * reverting back to hf_token --------- Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> Co-authored-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> * added hf import for 15b/340b pretrain (#14565) * Llmb nemo r2.4.0 (#14607) * Update mixed_precision.py Signed-off-by: guyueh1 <140554423+guyueh1@users.noreply.github.com> * Fix reuse_grad_buf_for_mxfp8_param_ag for mxfp8 Signed-off-by: Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by: guyueh1 <140554423+guyueh1@users.noreply.github.com> Signed-off-by: Guyue Huang <guyueh@nvidia.com> Co-authored-by: guyueh1 <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com> * Apply isort and black reformatting Signed-off-by: nv-mollys <nv-mollys@users.noreply.github.com> --------- Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com> Signed-off-by: malay-nagda <malayn@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> Signed-off-by: ashbhandare <ashbhandare@users.noreply.github.com> Signed-off-by: Barys Dubauski <bdubauski@nvdia.com> Signed-off-by: bdubauski <bdubauski@users.noreply.github.com> Signed-off-by: mollys <mollys@mollys.nvidia.com> Signed-off-by: nv-mollys <nv-mollys@users.noreply.github.com> Signed-off-by: Sebastian Alberdi <salberdi@salberdi-mlt.client.nvidia.com> Signed-off-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Signed-off-by: ashbhandare <abhandare@nvidia.com> Signed-off-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Signed-off-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: sshiddib <sshiddib@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Signed-off-by: Sebastian Alberdi <salberdi@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: Alex Filby <afilby@nvidia.com> Signed-off-by: sudostock <sudostock@users.noreply.github.com> Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> Signed-off-by: Sebastian Alberdi <salberdi@nvidia.com> Signed-off-by: guyueh1 <140554423+guyueh1@users.noreply.github.com> Signed-off-by: nv-mollys <149841089+nv-mollys@users.noreply.github.com> Co-authored-by: scsudhakaran <scsudhakaran@nvidia.com> Co-authored-by: rhmukundan <102543536+rhmukundan@users.noreply.github.com> Co-authored-by: malay-nagda <malayn@nvidia.com> Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com> Co-authored-by: mollys <mollys@mollys.nvidia.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: ashbhandare <ash.bhandare@gmail.com> Co-authored-by: Aishwarya Bhandare <abhandare@login-eos01.eos.clusters.nvidia.com> Co-authored-by: ashbhandare <ashbhandare@users.noreply.github.com> Co-authored-by: bdubauski <80418713+bdubauski@users.noreply.github.com> Co-authored-by: Barys Dubauski <bdubauski@nvdia.com> Co-authored-by: bdubauski <bdubauski@users.noreply.github.com> Co-authored-by: nv-mollys <nv-mollys@users.noreply.github.com> Co-authored-by: salberdi-nvidia <salberdi@nvidia.com> Co-authored-by: Sebastian Alberdi <salberdi@login-preos02.a51.clusters.nvidia.com> Co-authored-by: salberdi-nvidia <salberdi-nvidia@users.noreply.github.com> Co-authored-by: Sebastian Alberdi <salberdi@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: ashbhandare <abhandare@nvidia.com> Co-authored-by: Aishwarya Bhandare <abhandare@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: scsudhakaran <scsudhakaran@users.noreply.github.com> Co-authored-by: Sharada Shiddibhavi <sshiddibhavi@nvidia.com> Co-authored-by: Sharada Shiddibhavi <sshiddibhavi@cw-dfw-cs-001-vscode-02.cm.cluster> Co-authored-by: sshiddib <sshiddib@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Co-authored-by: rsalagame-nvidia <rsalagame@nvidia.com> Co-authored-by: Sebastian Alberdi <salberdi@cw-dfw-cs-001-login-02.cm.cluster> Co-authored-by: Alex Filby <alexfilby@gmail.com> Co-authored-by: sudostock <sudostock@users.noreply.github.com> Co-authored-by: rsalagame-nvidia <rsalagame-nvidia@users.noreply.github.com> Co-authored-by: guyueh1 <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
1 parent 6489229 commit 74f6303

40 files changed

Lines changed: 1672 additions & 337 deletions

nemo/collections/llm/recipes/llama4_e128.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -319,15 +319,9 @@ def finetune_recipe(
319319
packed_sequence,
320320
)
321321
if peft_scheme is None or peft_scheme.lower() == 'none':
322-
recipe.trainer.strategy.tensor_model_parallel_size = 4
323-
recipe.trainer.strategy.expert_tensor_model_parallel_size = 4
324-
recipe.trainer.strategy.expert_model_parallel_size = 32
322+
recipe.trainer.strategy.tensor_model_parallel_size = 2
325323
recipe.optim.config.lr = 5e-6
326324
elif peft_scheme.lower() in ['lora', 'dora']:
327-
recipe.trainer.strategy.sequence_parallel = True
328-
recipe.trainer.strategy.tensor_model_parallel_size = 8
329-
recipe.trainer.strategy.expert_tensor_model_parallel_size = 8
330-
recipe.trainer.strategy.pipeline_model_parallel_size = 4
331325
recipe.peft = run.Config(PEFT_STR2CLS[peft_scheme.lower()])
332326
recipe.peft.dim = 8
333327
recipe.peft.alpha = 16
@@ -397,9 +391,10 @@ def finetune_performance_optimizations(
397391
recipe.trainer.callbacks.append(
398392
run.Config(
399393
MegatronCommOverlapCallback,
400-
tp_comm_overlap=False,
394+
tp_comm_overlap=True,
401395
)
402396
)
397+
recipe.trainer.callbacks.append(run.Config(MegatronTokenDropCallback))
403398
recipe.trainer.callbacks.append(run.Config(TimingCallback))
404399
recipe.trainer.callbacks.append(
405400
run.Config(

nemo/collections/llm/recipes/nemotronh_56b.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing
3131
from nemo.collections.llm.recipes.precision.mixed_precision import nemotron_h_bf16_with_fp8_current_scaling_mixed
3232
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
33-
from nemo.lightning.pytorch.callbacks import ModelCheckpoint
3433
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
3534
from nemo.utils.exp_manager import TimingCallback
3635

@@ -143,22 +142,13 @@ def trainer(
143142
DistributedDataParallelConfig,
144143
check_for_nan_in_grad=True,
145144
overlap_grad_reduce=True,
146-
overlap_param_gather=False, # Verify that this works
145+
overlap_param_gather=True, # Verify that this works
147146
grad_reduce_in_fp32=True,
148147
),
149148
)
150149

151150
callbacks = [
152151
run.Config(TimingCallback),
153-
run.Config(
154-
ModelCheckpoint,
155-
every_n_train_steps=val_check_interval,
156-
dirpath=dir,
157-
save_top_k=save_top_k,
158-
always_save_context=True,
159-
save_optim_on_train_end=True,
160-
save_context_on_train_end=True,
161-
),
162152
]
163153
trainer = run.Config(
164154
nl.Trainer,
@@ -175,7 +165,7 @@ def trainer(
175165
use_distributed_sampler=False,
176166
plugins=[nemotron_h_bf16_with_fp8_current_scaling_mixed()],
177167
val_check_interval=val_check_interval,
178-
enable_checkpointing=True,
168+
enable_checkpointing=False, # Fix this: disable checkpointing for now
179169
)
180170
return trainer
181171

nemo/collections/llm/recipes/precision/mixed_precision.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ def fp16_with_fp8_mixed() -> run.Config[MegatronMixedPrecision]:
8686
cfg.fp8_amax_history_len = 1024
8787
cfg.fp8_amax_compute_algo = "max"
8888
cfg.fp8_param_gather = True
89+
cfg.reuse_grad_buf_for_mxfp8_param_ag = True
8990
return cfg
9091

9192

@@ -99,6 +100,7 @@ def bf16_with_mxfp8_mixed() -> run.Config[MegatronMixedPrecision]:
99100
cfg.fp8 = 'hybrid'
100101
cfg.fp8_recipe = "mxfp8"
101102
cfg.fp8_param_gather = True
103+
cfg.reuse_grad_buf_for_mxfp8_param_ag = True
102104
return cfg
103105

104106

@@ -112,6 +114,7 @@ def fp16_with_mxfp8_mixed() -> run.Config[MegatronMixedPrecision]:
112114
cfg.fp8 = 'hybrid'
113115
cfg.fp8_recipe = "mxfp8"
114116
cfg.fp8_param_gather = True
117+
cfg.reuse_grad_buf_for_mxfp8_param_ag = True
115118
return cfg
116119

117120

nemo/lightning/fabric/plugins.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ def __init__(
6060
first_last_layers_bf16: bool = False,
6161
num_layers_at_start_in_bf16: int = 0,
6262
num_layers_at_end_in_bf16: int = 0,
63+
reuse_grad_buf_for_mxfp8_param_ag: bool = False,
6364
fp8_margin: int = 0,
6465
fp8_amax_history_len: int = 1,
6566
fp8_amax_compute_algo: str = "most_recent",
@@ -104,6 +105,7 @@ def __init__(
104105
first_last_layers_bf16=first_last_layers_bf16,
105106
num_layers_at_start_in_bf16=num_layers_at_start_in_bf16,
106107
num_layers_at_end_in_bf16=num_layers_at_end_in_bf16,
108+
reuse_grad_buf_for_mxfp8_param_ag=reuse_grad_buf_for_mxfp8_param_ag,
107109
fp8_margin=fp8_margin,
108110
fp8_amax_history_len=fp8_amax_history_len,
109111
fp8_amax_compute_algo=fp8_amax_compute_algo,

nemo/lightning/pytorch/plugins/mixed_precision.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ class DtypeConfig:
8686
hysteresis: float = (None,)
8787
num_layers_at_start_in_bf16: int = 0
8888
num_layers_at_end_in_bf16: int = 0
89+
reuse_grad_buf_for_mxfp8_param_ag: bool = False
8990

9091

9192
class MegatronMixedPrecision(Precision):
@@ -122,6 +123,7 @@ def __init__(
122123
fp16_hysteresis: int = 2,
123124
num_layers_at_start_in_bf16: int = 0,
124125
num_layers_at_end_in_bf16: int = 0,
126+
reuse_grad_buf_for_mxfp8_param_ag: bool = False,
125127
) -> None:
126128
if fp8_params is not None:
127129
logging.warning(
@@ -161,6 +163,7 @@ def __init__(
161163
fp8_param_gather=fp8_param_gather,
162164
num_layers_at_start_in_bf16=num_layers_at_start_in_bf16,
163165
num_layers_at_end_in_bf16=num_layers_at_end_in_bf16,
166+
reuse_grad_buf_for_mxfp8_param_ag=reuse_grad_buf_for_mxfp8_param_ag,
164167
# fp16 loss scale
165168
loss_scale=fp16_loss_scale,
166169
initial_loss_scale=fp16_initial_loss_scale,

nemo/lightning/run/plugins.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,9 @@ class NsysPlugin(run.Plugin):
158158
end_step: int
159159
ranks: Optional[list[int]] = None
160160
nsys_trace: Optional[list[str]] = None
161+
nsys_extra_args: Optional[list[str]] = None
161162
gen_shape: bool = False
163+
nsys_gpu_metrics: bool = False
162164

163165
def setup(self, task: run.Partial | run.Script, executor: run.Executor):
164166
"""Set up the nsys profiling plugin."""
@@ -179,6 +181,21 @@ def setup(self, task: run.Partial | run.Script, executor: run.Executor):
179181
if isinstance(executor, run.SlurmExecutor):
180182
# NOTE: DO NOT change to f-string, `%q{}` is Slurm placeholder
181183
launcher.nsys_filename = "profile_%p_%q{SLURM_JOB_ID}_node%q{SLURM_NODEID}_rank%q{SLURM_PROCID}"
184+
launcher.nsys_extra_args = self.nsys_extra_args or [
185+
"--force-overwrite=true",
186+
"--capture-range=cudaProfilerApi",
187+
"--capture-range-end=stop",
188+
"--cuda-graph-trace=node",
189+
"--cuda-event-trace=false",
190+
"--nvtx-domain-include=NCCL",
191+
]
192+
if self.nsys_gpu_metrics:
193+
if hasattr(launcher, "nsys_gpu_metrics"):
194+
launcher.nsys_gpu_metrics = self.nsys_gpu_metrics
195+
else:
196+
logging.warning(
197+
"Unable to enable nsys gpu metrics collection. Please upgrade Nemo-Run to include commit 70a0df4."
198+
)
182199

183200

184201
@dataclass(kw_only=True)

scripts/performance/argument_parser.py

Lines changed: 141 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,19 @@ def parse_cli_args():
2626
"""
2727
parser = argparse.ArgumentParser(description="NeMo2.0 Performance Pretraining and Fine-Tuning")
2828

29-
parser.add_argument(
29+
subparsers = parser.add_subparsers(dest="cluster_type", help='Type of cluster: slurm or runai')
30+
31+
slurm_parser = subparsers.add_parser('slurm', help="define variables for slurm launcher")
32+
runai_parser = subparsers.add_parser('runai', help="define variables for runai launcher")
33+
34+
slurm_parser.add_argument(
3035
"-a",
3136
"--account",
3237
type=str,
3338
help="Slurm account to use for experiment",
3439
required=True,
3540
)
36-
parser.add_argument(
41+
slurm_parser.add_argument(
3742
"-p",
3843
"--partition",
3944
type=str,
@@ -48,22 +53,58 @@ def parse_cli_args():
4853
help="Target gpu type.",
4954
required=True,
5055
)
51-
parser.add_argument(
56+
slurm_parser.add_argument(
5257
"-l",
5358
"--log_dir",
5459
type=str,
5560
help=f"Directory for logging experiment results. Defaults to {get_nemorun_home()}",
5661
required=False,
5762
default=get_nemorun_home(),
5863
)
59-
parser.add_argument(
64+
slurm_parser.add_argument(
6065
"-t",
6166
"--time_limit",
6267
type=str,
6368
help="Maximum time limit to run experiment for. Defaults to 30 minutes (format- 'HH:MM:SS')",
6469
required=False,
6570
default="00:30:00",
6671
)
72+
runai_parser.add_argument(
73+
"-b",
74+
"--base_url",
75+
help="NVIDIA Run:ai API url to use for experiment. Should look like https://<base-url>/api/v1",
76+
type=str,
77+
required=True,
78+
)
79+
80+
runai_parser.add_argument(
81+
"-id",
82+
"--app_id",
83+
help="Name of NVIDIA Run:ai Application",
84+
type=str,
85+
required=True,
86+
)
87+
runai_parser.add_argument(
88+
"-s",
89+
"--app_secret",
90+
help="NVIDIA Run:ai Application secret",
91+
type=str,
92+
required=True,
93+
)
94+
runai_parser.add_argument(
95+
"-p",
96+
"--project_name",
97+
help="NVIDIA Run:ai Project to run the experiment in",
98+
type=str,
99+
required=True,
100+
)
101+
runai_parser.add_argument(
102+
"-pd",
103+
"--pvc_nemo_run_dir",
104+
help="Directory path of your nemo-run home in Run:ai PVC",
105+
type=str,
106+
required=True,
107+
)
67108
container_img_msg = [
68109
"NeMo container to use for experiment. Defaults to latest dev container- 'nvcr.io/nvidia/nemo:dev'",
69110
"Make sure your NGC credentials are accessible in your environment.",
@@ -101,7 +142,7 @@ def parse_cli_args():
101142
parser.add_argument(
102143
"-en",
103144
"--enable_nsys",
104-
help="Enable Nsys profiling. Diabled by default",
145+
help="Enable Nsys profiling. Disabled by default",
105146
action="store_true",
106147
)
107148
parser.add_argument(
@@ -274,7 +315,7 @@ def parse_cli_args():
274315
type=int,
275316
help="Number of train steps. Defaults to 100",
276317
required=False,
277-
default=100,
318+
default=50,
278319
)
279320

280321
def bool_arg(arg):
@@ -349,6 +390,52 @@ def bool_arg(arg):
349390
required=False,
350391
default=None,
351392
)
393+
parser.add_argument(
394+
"-nlay",
395+
"--num_layers",
396+
type=int,
397+
help="Sets number of model layers.",
398+
required=False,
399+
default=None,
400+
)
401+
parser.add_argument(
402+
"-hs",
403+
"--hidden_size",
404+
type=int,
405+
help="Sets hidden model size",
406+
required=False,
407+
default=None,
408+
)
409+
parser.add_argument(
410+
"-pss", "--profiling_start_step", type=int, help="Defines start step for profiling", required=False, default=46
411+
)
412+
parser.add_argument(
413+
"-pso", "--profiling_stop_step", type=int, help="Defines start step for profiling", required=False, default=50
414+
)
415+
416+
parser.add_argument(
417+
"-pgm",
418+
"--profiling_gpu_metrics",
419+
help="Enable nsys gpu metrics. Disabled by default.",
420+
action="store_true",
421+
)
422+
423+
parser.add_argument(
424+
"-cps",
425+
"--checkpoint_save",
426+
type=bool_arg,
427+
help="When enabled will trigger checkpoint save operation at the end of training",
428+
required=False,
429+
default=None,
430+
)
431+
parser.add_argument(
432+
"-cpl",
433+
"--checkpoint_load_path",
434+
type=str,
435+
help="Path to checkpoint to load prior to training start",
436+
required=False,
437+
default=None,
438+
)
352439

353440
def list_of_strings(arg):
354441
return arg.split(',')
@@ -368,7 +455,7 @@ def list_of_strings(arg):
368455
"-cm",
369456
"--custom_mounts",
370457
type=list_of_strings,
371-
help="Comma separated string of mounts",
458+
help="Comma separated string of mounts. For Run:ai, each mount must be in name:path:k8s-claimName format",
372459
required=False,
373460
default=[],
374461
)
@@ -386,4 +473,51 @@ def list_of_strings(arg):
386473
default=None,
387474
)
388475

476+
parser.add_argument(
477+
"--skip_import_checkpoint",
478+
help="Skips checkpoint import, finetuning job and only downloads the dataset.",
479+
action="store_true",
480+
required=False,
481+
)
482+
483+
parser.add_argument(
484+
"--skip_dataset_download",
485+
help="Skips dataset download, finetuning job and only downloads the checkpoint.",
486+
action="store_true",
487+
required=False,
488+
)
489+
490+
parser.add_argument(
491+
"--skip_finetuning",
492+
help="Skips finetuning and only downloads the checkpoint and dataset.",
493+
action="store_true",
494+
required=False,
495+
)
496+
497+
parser.add_argument(
498+
"-ev",
499+
"--custom_env_vars",
500+
type=str,
501+
required=False,
502+
default={},
503+
)
504+
505+
parser.add_argument(
506+
"-cpin",
507+
"--cpu_pinning",
508+
type=int,
509+
help="Enable CPU pinning to improve performance on some clusters by setting numbers of CPUs per task. Disabled by default",
510+
required=False,
511+
default=0,
512+
)
513+
514+
parser.add_argument(
515+
"-vb",
516+
"--enable_vboost",
517+
help="Enable VBoost which steers more power towards tensor cores. Disabled by default",
518+
type=bool_arg,
519+
required=False,
520+
default=None,
521+
)
522+
389523
return parser

0 commit comments

Comments
 (0)