update

vukrosic · vukrosic · commit 5c032dc75372 · 2025-10-04T11:52:44.000+02:00
diff --git a/public/content/pretrain-llm-with-nvfp4/pretrain-llms-with-fp4-content-zh.md b/public/content/pretrain-llm-with-nvfp4/pretrain-llms-with-fp4-content-zh.md
@@ -59,9 +59,13 @@ NVFP4 使用两个不同的缩放因子，这是其最关键特性。为理解
 
 两级缩放的工作方式如下：
 
-1. **粗粒度、每张量缩放（FP32）**：首先，基于整个张量的绝对最大值（max(abs(tensor))）计算一个缩放因子。该因子以高精度 **FP32** 存储，对张量中所有数值进行粗略的全局调整，使其进入更易处理的中间范围，同时避免缩放因子本身成为误差源。使用高精度格式 FP32 至关重要，因为不精确的缩放因子会错误地缩小张量中的每个值，在最终量化步骤之上叠加额外误差。
+1. **粗粒度、每张量缩放（FP32）**：首先，根据整个张量的绝对最大值（`max(abs(tensor))`）计算一个缩放因子。这个因子以高精度**FP32**存储，进行粗略的全局调整。这第一步的目标是获取原始张量（例如BF16或FP32格式），并将其缩放，使其绝对最大值映射到一个块可以表示的最大可能值。这个“最大可能块值”是最大FP4值和最大FP8缩放因子值的乘积。
+    -   最大FP4 (E2M1) 值：6
+    -   最大FP8 (E4M3) 缩放因子值：448
+    -   因此，最大组合值为 6 * 448 = 2688。
+    全局缩放将张量带入FP4数据类型及其本地FP8缩放因子的组合可表示范围，即 [-2688, 2688]。使用像FP32这样的高精度格式至关重要，因为不精确的缩放因子会不准确地缩小张量中的每个值，在最终量化步骤之上增加误差。
 
-2. **细粒度、每块缩放（FP8）**：在应用第一次缩放后，张量被划分为成千上万个 16 元素的小块。对**每个**小块，计算第二个更精确的缩放因子。该局部因子以 **FP8** 存储，进行精细调整，将 16 个值完美映射到 FP4 极其有限的范围内。此处使用的 FP4 格式（E2M1）仅能表示 ±0、±0.5、±1、±1.5、±2、±3、±4 和 ±6。
+2. **细粒度、每块缩放（FP8）**：在应用第一次缩放后，张量被划分为成千上万个 16 元素的小块。对**每个**小块，计算第二个更精确的缩放因子。该局部因子以 **FP8** 存储，进行精细调整，将 16 个值完美映射到 FP4 极其有限的范围内。全局缩放不会将整个张量带入[-6, 6]范围。这是第二个本地缩放步骤的工作。此处使用的 FP4 格式（E2M1）仅能表示 ±0、±0.5、±1、±1.5、±2、±3、±4 和 ±6。
 
 这种双重方法之所以强大，是因为它允许高度局部化的自适应。张量某部分的一个大异常值仅会影响其所在 16 元素小块的缩放，完全不影响其他所有块的量化。相比单一缩放因子，这显著保留了更多信息。
 
diff --git a/public/content/pretrain-llm-with-nvfp4/pretrain-llms-with-fp4-content.md b/public/content/pretrain-llm-with-nvfp4/pretrain-llms-with-fp4-content.md
@@ -56,8 +56,12 @@ NVFP4 uses two distinct scaling factors, which is its most critical feature. To
 *   A **Block** is a small, fixed-size chunk of a tensor. In NVFP4, a block is a group of just 16 contiguous numbers. So, the enormous weight and activation tensors above would be partitioned into thousands or millions of these tiny blocks for quantization.
 
 The two-level scaling works as follows:
-1.  **Coarse, Per-Tensor Scaling (FP32):** First, a single scaling factor is calculated for the *entire tensor* based on its absolute maximum value (max value(abs(tensor))). This factor, stored in high-precision **FP32**, performs a rough, global adjustment, bringing all the values in the tensor into a more manageable intermediate range without the scale factor itself becoming a source of error. Using a high-precision format FP32 is crucial because an imprecise scale factor would inaccurately shrink every value in the tensor, adding error on top of the final quantization step.
-2.  **Fine-Grained, Per-Block Scaling (FP8):** After the first scaling is applied, the tensor is divided into thousands of small 16-element blocks. For *each* of these blocks, a second, more precise scaling factor is calculated. This local factor, stored in **FP8**, makes a fine-tuned adjustment, perfectly mapping the 16 values to the extremely limited range of FP4. Using FP8 for the block-level scale provides enough precision for local adjustments while remaining efficient for the hardware to process. The FP4 format used here (E2M1) can only represent the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6.
+1.  **Coarse, Per-Tensor Scaling (FP32):** First, a single scaling factor is calculated for the *entire tensor* based on its absolute maximum value (max value(abs(tensor))). This factor, stored in high-precision **FP32**, performs a rough, global adjustment. The goal of this first step is to take the original tensor (e.g., in BF16 or FP32) and scale it so that its absolute maximum value is mapped to the largest possible value that a block can represent. This "largest possible block value" is the product of the maximum FP4 value and the maximum FP8 scale factor value.
+    - Maximum FP4 (E2M1) value: 6
+    - Maximum FP8 (E4M3) scale factor value: 448
+    - Therefore, the maximum combined value is 6 * 448 = 2688.
+    The global scaling brings the tensor into the combined representable range of the FP4 data type and its local FP8 scale factor, which is [-2688, 2688]. Using a high-precision format like FP32 is crucial because an imprecise scale factor would inaccurately shrink every value in the tensor, adding error on top of the final quantization step.
+2.  **Fine-Grained, Per-Block Scaling (FP8):** After the first scaling is applied, the tensor is divided into thousands of small 16-element blocks. For *each* of these blocks, a second, more precise scaling factor is calculated. This local factor, stored in **FP8**, makes a fine-tuned adjustment, perfectly mapping the 16 values to the extremely limited range of FP4. The global scaling does not bring the entire tensor into the [-6, 6] range. That is the job of this second, local scaling step. The FP4 format used here (E2M1) can only represent the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6.
 
 This dual approach is powerful because it allows for highly localized adaptation. A large outlier in one part of the tensor will only affect the scaling of its tiny 16-element block, leaving the quantization of all other blocks completely unaffected. This preserves significantly more information compared to a single scaling factor.
 -   **Reduced Block Size for Better Dynamic Range:** NVFP4 uses a smaller micro-block size of 16 elements. This is crucial for **capturing the local dynamic range**. In simpler terms, if a block of numbers contains one large outlier, only the other 15 numbers in that small block are affected during scaling. In a larger block (e.g., 32 elements), that same outlier would force a less precise scaling for all 31 other numbers, potentially causing more information loss. The smaller block size isolates the impact of outliers.