[NVBUG: 5612606] Clear GPU cache for large models layer quantization during export (#497)

cjluo-nv · mxinO · commit 016f64c76f75 · 2025-11-10T22:15:43.000-08:00
## What does this PR do?

**Type of change:** Bug fix

**Overview:** ?

For large models like llama4 maverick, the stacked weights to fp8
conversion might hit OOM. This change aim to fix that.

---------

Signed-off-by: Chenjie Luo &lt;108829653+cjluo-nv@users.noreply.github.com&gt;
Signed-off-by: mxin &lt;mxin@nvidia.com&gt;
diff --git a/modelopt/torch/export/quant_utils.py b/modelopt/torch/export/quant_utils.py
@@ -37,6 +37,7 @@
     quantizer_attr_names,
     weight_attr_names,
 )
+from modelopt.torch.utils import clear_cuda_cache
 
 from ..quantization.nn import SequentialQuantizer, TensorQuantizer
 from .model_config import (
@@ -763,6 +764,8 @@ def to_quantized_weight(
 
         if weight.dim() == 3:
             # for MOE stacked weights
+            # Clear GPU cache to avoid pontential GPU OOM issues for large models.
+            clear_cuda_cache()
             return (weight / weights_scaling_factor.unsqueeze(-1)).to(torch.float8_e4m3fn)
         return (weight / weights_scaling_factor).to(torch.float8_e4m3fn)