Merge branch 'main' into riva_decoder

GNroy · Aug 10, 2024 · 5d5abf8 · 5d5abf8
2 parents 90c646e + 86715c1
commit 5d5abf8
Show file tree

Hide file tree

Showing 167 changed files with 7,193 additions and 2,393 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
diff --git a/.github/workflows/import-test.yml b/.github/workflows/import-test.yml
@@ -12,7 +12,7 @@ jobs:
   test-asr-imports:
     runs-on: ubuntu-latest
     container:
-      image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
+      image: pytorch/pytorch:2.4.0-cuda11.8-cudnn9-runtime
     steps:
     - name: Checkout repo
       uses: actions/checkout@v2
@@ -43,7 +43,7 @@ jobs:
   test-tts-imports:
     runs-on: ubuntu-latest
     container:
-      image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
+      image: pytorch/pytorch:2.4.0-cuda11.8-cudnn9-runtime
     steps:
     - name: Checkout repo
       uses: actions/checkout@v2
@@ -70,4 +70,4 @@ jobs:
         # Run import checks
         python tests/core_ptl/check_imports.py --domain "tts"
         # Uninstall NeMo
-        pip uninstall -y nemo_toolkit
+        pip uninstall -y nemo_toolkit
diff --git a/Dockerfile.ci b/Dockerfile.ci
@@ -33,8 +33,8 @@ WORKDIR /workspace
 
 # Install NeMo requirements
 ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
-ARG MODELOPT_VERSION=0.13.0
-ARG MCORE_TAG=2bbe55be32e2d478c4b2ce575af1cccb8fc3d9b9
+ARG MODELOPT_VERSION=0.15.0
+ARG MCORE_TAG=2fd6e2b74efca73a1f2d27b89bb5419384b4d3bf
 ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
 RUN \
 --mount=type=bind,source=requirements,target=requirements \

diff --git a/README.md b/README.md
@@ -10,10 +10,38 @@
 # **NVIDIA NeMo Framework**
 
 ## Latest News
+
 <!-- markdownlint-disable -->
 <details open>
-  <summary><b>Large Language Models and Multimodal</b></summary>
+  <summary><b>Large Language Models and Multimodal Models</b></summary>
+      <details>
+      <summary>
+        <a href="https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/llama/index.html#new-llama-3-1-support for more information/">
+        New Llama 3.1 Support
+        </a> (2024-07-23)
+      </summary>
+        The NeMo Framework now supports training and customizing the Llama 3.1 collection of LLMs from Meta.
+      <br><br>
+    </details>
     <details>
+      <summary>
+        <a href="https://aws.amazon.com/blogs/machine-learning/accelerate-your-generative-ai-distributed-training-workloads-with-the-nvidia-nemo-framework-on-amazon-eks/">
+          Accelerate your Generative AI Distributed Training Workloads with the NVIDIA NeMo Framework on Amazon EKS
+        </a> (2024-07-16)
+      </summary>
+     NVIDIA NeMo Framework now runs distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. For step-by-step instructions on creating an EKS cluster and running distributed training workloads with NeMo, see the GitHub repository <a href="https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher/EKS/"> here.</a>
+      <br><br>
+    </details>
+    <details>
+      <summary>
+        <a href="https://developer.nvidia.com/blog/nvidia-nemo-accelerates-llm-innovation-with-hybrid-state-space-model-support/">
+          NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
+        </a> (2024/06/17)
+      </summary>
+     NVIDIA NeMo and Megatron Core now support pre-training and fine-tuning of state space models (SSMs). NeMo also supports training models based on the Griffin architecture as described by Google DeepMind. 
+      <br><br>
+    </details>
+      <details>
       <summary>
         <a href="https://huggingface.co/models?sort=trending&search=nvidia%2Fnemotron-4-340B">
           NVIDIA releases 340B base, instruct, and reward models pretrained on a total of 9T tokens.
@@ -46,45 +74,6 @@
         The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.
         <br><br>
       </details>
-    <details>
-      <summary>
-        <a href="https://blogs.nvidia.com/blog/bria-builds-responsible-generative-ai-using-nemo-picasso/">
-          Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso
-        </a> (2024/03/06)
-      </summary>
-      Bria, a Tel Aviv startup at the forefront of visual generative AI for enterprises now leverages the NVIDIA NeMo Framework. 
-      The Bria.ai platform uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput and low-latency image generation. 
-      Bria has also adopted NVIDIA Picasso, a foundry for visual generative AI models, to run inference.
-      <br><br>
-    </details>
-    <details>
-      <summary>
-        <a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/">
-          New NVIDIA NeMo Framework Features and NVIDIA H200
-        </a> (2023/12/06)
-      </summary>
-      NVIDIA NeMo Framework now includes several optimizations and enhancements, 
-      including: 
-      1) Fully Sharded Data Parallelism (FSDP) to improve the efficiency of training large-scale AI models, 
-      2) Mix of Experts (MoE)-based LLM architectures with expert parallelism for efficient LLM training at scale, 
-      3) Reinforcement Learning from Human Feedback (RLHF) with TensorRT-LLM for inference stage acceleration, and 
-      4) up to 4.2x speedups for Llama 2 pre-training on NVIDIA H200 Tensor Core GPUs.
-      <br><br>
-      <a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility">
-      <img src="https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png" alt="H200-NeMo-performance" style="width: 600px;"></a>
-      <br><br>
-    </details>
-    <details>
-      <summary>
-        <a href="https://blogs.nvidia.com/blog/nemo-amazon-titan/">
-          NVIDIA now powers training for Amazon Titan Foundation models
-        </a> (2023/11/28)
-      </summary>
-      NVIDIA NeMo Framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). 
-      The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. 
-      The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
-      <br><br>
-    </details>
 </details>
 
 <details open>
@@ -604,6 +593,53 @@ to the `gh-pages-src` branch of this repository. For detailed
 information, please consult the README located at the [gh-pages-src
 branch](https://github.com/NVIDIA/NeMo/tree/gh-pages-src#readme).
 
+## Blogs
+
+<!-- markdownlint-disable -->
+<details open>
+  <summary><b>Large Language Models and Multimodal Models</b></summary>
+    <details>
+      <summary>
+        <a href="https://blogs.nvidia.com/blog/bria-builds-responsible-generative-ai-using-nemo-picasso/">
+          Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso
+        </a> (2024/03/06)
+      </summary>
+      Bria, a Tel Aviv startup at the forefront of visual generative AI for enterprises now leverages the NVIDIA NeMo Framework. 
+      The Bria.ai platform uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput and low-latency image generation. 
+      Bria has also adopted NVIDIA Picasso, a foundry for visual generative AI models, to run inference.
+      <br><br>
+    </details>
+    <details>
+      <summary>
+        <a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/">
+          New NVIDIA NeMo Framework Features and NVIDIA H200
+        </a> (2023/12/06)
+      </summary>
+      NVIDIA NeMo Framework now includes several optimizations and enhancements, 
+      including: 
+      1) Fully Sharded Data Parallelism (FSDP) to improve the efficiency of training large-scale AI models, 
+      2) Mix of Experts (MoE)-based LLM architectures with expert parallelism for efficient LLM training at scale, 
+      3) Reinforcement Learning from Human Feedback (RLHF) with TensorRT-LLM for inference stage acceleration, and 
+      4) up to 4.2x speedups for Llama 2 pre-training on NVIDIA H200 Tensor Core GPUs.
+      <br><br>
+      <a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility">
+      <img src="https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png" alt="H200-NeMo-performance" style="width: 600px;"></a>
+      <br><br>
+    </details>
+    <details>
+      <summary>
+        <a href="https://blogs.nvidia.com/blog/nemo-amazon-titan/">
+          NVIDIA now powers training for Amazon Titan Foundation models
+        </a> (2023/11/28)
+      </summary>
+      NVIDIA NeMo Framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). 
+      The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. 
+      The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
+      <br><br>
+    </details>
+</details>
+<!-- markdownlint-enable -->
+
 ## Licenses
 
 - [NeMo GitHub Apache 2.0

diff --git a/docs/source/core/exp_manager.rst b/docs/source/core/exp_manager.rst
@@ -248,48 +248,6 @@ You might also want to adjust the callback parameters:
 
 Straggler detection might involve inter-rank synchronization, and should be invoked with reasonable frequency (e.g. every few minutes).
 
-.. _exp_manager_straggler_det_support-label:
-
-.. note::
-    Stragglers Detection feature is included in the optional NeMo resiliency package.
-
-Distributed training can be affected by stragglers, which are slow workers that slow down the overall training process. 
-NeMo provides a straggler detection feature that can identify slower GPUs.
-
-This feature is implemented in the ``StragglerDetectionCallback``, which is disabled by default.
-
-The callback computes normalized GPU performance scores, which are scalar values ranging from 0.0 (worst) to 1.0 (best). 
-A performance score can be interpreted as the ratio of current performance to reference performance.
-
-There are two types of performance scores provided by the callback:
-    - Relative GPU performance score: The best-performing GPU in the workload is used as a reference.
-    - Individual GPU performance score: The best historical performance of the GPU is used as a reference.
-
-Examples:
-    - If the relative performance score is 0.5, it means that a GPU is twice slower than the fastest GPU.
-    - If the individual performance score is 0.5, it means that a GPU is twice slower than its best observed performance.
-
-If a GPU performance score drops below the specified threshold, it is identified as a straggler.
-
-To enable straggler detection, add ``create_straggler_detection_callback: True`` under exp_manager in the config YAML file. 
-You might also want to adjust the callback parameters:
-
-.. code-block:: yaml
-
-    exp_manager:
-        ...
-        create_straggler_detection_callback: True
-        straggler_detection_callback_params:
-            report_time_interval: 300      # Interval [seconds] of the straggler check
-            calc_relative_gpu_perf: True   # Calculate relative GPU performance
-            calc_individual_gpu_perf: True # Calculate individual GPU performance
-            num_gpu_perf_scores_to_log: 5       # Log 5 best and 5 worst GPU performance scores, even if no stragglers are detected
-            gpu_relative_perf_threshold: 0.7    # Threshold for relative GPU performance scores
-            gpu_individual_perf_threshold: 0.7  # Threshold for individual GPU performance scores
-            stop_if_detected: True              # Terminate the workload if stragglers are detected
-
-Straggler detection might involve inter-rank synchronization, and should be invoked with reasonable frequency (e.g. every few minutes).
-
 Fault Tolerance
 ---------------
 
@@ -334,9 +292,10 @@ Timeouts for fault detection need to be adjusted for a given workload:
 checkpointing related operations should be taken into account.
 
 If ``calculate_timeouts: True`` timeouts will be automatically estimated based on observed intervals. 
-Estimated timeouts take precedence over timeouts defined in the config file. **Timeouts are estimated after 
-checkpoint loading and saving was observed**. For example, in multi-part training started from scratch, 
-estimated timeouts won't be available during the first run. Estimated timeouts are stored in the checkpoint. 
+Estimated timeouts take precedence over timeouts defined in the config file. **Timeouts are estimated 
+at the end of a training run, when checkpoint loading and saving were observed**. Hence, in a multi-part 
+training started from scratch, estimated timeouts won't be available during initial two runs. 
+Estimated timeouts are stored in a separate JSON file. 
 
 ``max_subsequent_job_failures`` allows for the automatic continuation of training on a SLURM cluster. 
 This feature requires SLURM job to be scheduled with ``NeMo-Framework-Launcher``. If ``max_subsequent_job_failures`` 
@@ -346,10 +305,12 @@ subsequent jobs failed (SLURM job exit code is `!= 0`) or the training is comple
 
 All FT configuration items summary:
     * ``workload_check_interval`` (float, default=5.0) Periodic workload check interval [seconds] in the workload monitor.
-    * ``initial_rank_heartbeat_timeout`` (Optional[float], default=60.0 * 60.0) Timeout for the first heartbeat from a rank. 
-    * ``rank_heartbeat_timeout`` (Optional[float], default=45.0 * 60.0) Timeout for subsequent heartbeats from a rank. 
+    * ``initial_rank_heartbeat_timeout`` (Optional[float], default=60.0 * 60.0) Timeout [seconds] for the first heartbeat from a rank. 
+    * ``rank_heartbeat_timeout`` (Optional[float], default=45.0 * 60.0) Timeout [seconds] for subsequent heartbeats from a rank. 
     * ``calculate_timeouts`` (bool, default=True) Try to calculate ``rank_heartbeat_timeout`` and ``initial_rank_heartbeat_timeout`` 
       based on the observed heartbeat intervals.
+    * ``safety_factor``: (float, default=5.0) When calculating the timeouts, multiply the maximum observed heartbeat interval 
+      by this factor to obtain the timeout estimate. Can be made smaller for stable environments and larger for unstable ones.  
     * ``rank_termination_signal`` (signal.Signals, default=signal.SIGKILL) Signal used to terminate the rank when failure is detected.
     * ``log_level`` (str, default='INFO') Log level for the FT client and server(rank monitor).
     * ``max_rank_restarts`` (int, default=0) Used by FT launcher. Max number of restarts for a rank. 

diff --git a/docs/source/features/moe.rst b/docs/source/features/moe.rst
@@ -0,0 +1,75 @@
+Mixture of Experts
+==================
+
+Overview
+--------
+
+NeMo Framework supports Mixture of Experts (MoE) in the feedforward block of the transformer layer.
+
+MoE is a machine learning technique where multiple specialized models (experts,
+usually multi-layer perceptrons) are combined to solve a complex task. Each expert
+focuses on a specific subtask or domain, while a gating network dynamically activates
+the most appropriate expert based on the current input.
+
+
+Use MoE
+-------
+
+To use MoE  in the NeMo Framework, adjust the ``num_moe_experts`` parameter in the model configuration:
+
+1. Set ``num_moe_experts`` to `8` to leverage 8 experts in the MoE module.
+
+   .. code-block:: yaml
+
+       num_moe_experts: 8  # Set MoE to use 8 experts
+
+2. Set ``moe_router_topk`` to the number of experts you want activated. For example, if you want to process each input with two experts:
+
+   .. code-block:: yaml
+
+       moe_router_topk: 2  # Processes each token using 2 experts.
+
+Configure MoE-specific Loss Functions
+-------------------------------------
+
+In addition, NeMo provides options to configure MoE-specific loss function.
+To balance token distribution across experts:
+
+1. Set ``moe_router_load_balancing_type`` to specify the load balancing method:
+
+   .. code-block:: yaml
+
+      moe_router_load_balancing_type: aux_loss  # to use the auxilary loss, other options include "sinkhorn".
+
+2. Set ``moe_aux_loss_coeff`` to specify the weight of the auxilary loss. The auxiliary loss is added to encourage distributing tokens equally among all experts. Values in the 1e-2 range are a good start, as follows:
+
+   .. code-block:: yaml
+
+      moe_aux_loss_coeff: 1e-2  # set the aux-loss weight to 1e-2
+
+3. Set ``moe_z_loss_coeff`` to specify the weight of the z-loss. A starting value of 1e-3 is recommended, as follows:
+
+   .. code-block:: yaml
+
+      moe_z_loss_coeff: 1e-3
+
+Other options include:
+
+1. ``moe_input_jitter_eps`` adds noise to the input tensor by applying jitter with a specified epsilon value.
+
+2. ``moe_token_dropping`` enables selectively dropping and padding tokens for each expert to achieve
+   a specified capacity, similar to GShard, Switch-Transformer, and DeepSpeed-MoE. Briefly, if the number
+   of tokens routed to an expert exceeds its capacity, then the exceeding tokens are dropped. Note that this is
+    currently unsupported so should remain False.
+
+3. ``moe_token_dispatcher_type`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'.
+
+4. ``moe_per_layer_logging`` enables per-layer logging for MoE, currently support aux-loss and z-loss.
+
+5. ``moe_expert_capacity_factor`` the capacity factor determines the maximum number of tokens that can be routed to each expert in any MoE layer. None means no token will be dropped. The default is None.
+
+6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length. It is effective only after the moe_expert_capacity_factor is set. The default setting is False.
+
+7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. The default value is "probs".
+
+8. ``moe_layer_recompute`` if True, checkpointing moe_layer to save activation memory. The default is False.
diff --git a/...rce/features/activation_recomputation.rst → ...ptimizations/activation_recomputation.rst b/...rce/features/activation_recomputation.rst → ...ptimizations/activation_recomputation.rst
@@ -35,4 +35,18 @@ This is because the input sizes of softmax, dropout, and qkv dot-product attenti
 However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square.
 
 Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine.
-Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``.
+Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``.
+
+Scheme of full and selective checkpointing granularity:
+
+.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-2.jpg
+    :align: center
+    :alt: activation-recomputation-example-2
+    :scale: 50%
+
+Scheme of uniform and block checkpointing method (full checkpointing granularity):
+
+.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-1.jpg
+    :align: center
+    :alt: activation-recomputation-example-1
+    :scale: 50%