feedback

stevhliu · stevhliu · commit 1089c80d411f · 2025-06-20T09:41:59.000-07:00
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
@@ -301,6 +301,20 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d
 
 The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
 
+#### Offloading to disk
+
+Group offloading can consume significant system memory depending on the model size. On systems with limited memory, try group offloading onto the disk as a secondary memory.
+
+Set the `offload_to_disk_path` argument in either [`~ModelMixin.enable_group_offload`] or [`~hooks.apply_group_offloading`] to offload the model to the disk.
+
+```py
+pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", offload_to_disk_path="path/to/disk")
+
+apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2, offload_to_disk_path="path/to/disk")
+```
+
+Refer to these [two](https://github.com/huggingface/diffusers/pull/11682#issue-3129365363) [tables](https://github.com/huggingface/diffusers/pull/11682#issuecomment-2955715126) to compare the speed and memory trade-offs.
+
 ## Layerwise casting
 
 > [!TIP]
diff --git a/docs/source/en/optimization/speed-memory-optims.md b/docs/source/en/optimization/speed-memory-optims.md
@@ -37,7 +37,7 @@ pip install -U bitsandbytes
 
 Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
 
-Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models with `fullgraph=True`.
+Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
 
 ```py
 import torch
@@ -72,7 +72,7 @@ pipeline("""
 
 In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations.
 
-Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation.
+Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation and set `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
 
 <hfoptions id="offloading">
 <hfoption id="model CPU offloading">
@@ -85,6 +85,7 @@ from diffusers import DiffusionPipeline
 from diffusers.quantizers import PipelineQuantizationConfig
 
 torch._dynamo.config.cache_size_limit = 1000
+torch._dynamo.config.capture_dynamic_output_shape_ops = True
 
 # quantize
 pipeline_quant_config = PipelineQuantizationConfig(
@@ -125,6 +126,7 @@ from diffusers.quantizers import PipelineQuantizationConfig
 from transformers import UMT5EncoderModel
 
 torch._dynamo.config.cache_size_limit = 1000
+torch._dynamo.config.capture_dynamic_output_shape_ops = True
 
 # quantize
 pipeline_quant_config = PipelineQuantizationConfig(