Skip to content

Commit 1089c80

Browse files
committed
feedback
1 parent 43b269d commit 1089c80

File tree

2 files changed

+18
-2
lines changed

2 files changed

+18
-2
lines changed

docs/source/en/optimization/memory.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,20 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d
301301

302302
The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
303303

304+
#### Offloading to disk
305+
306+
Group offloading can consume significant system memory depending on the model size. On systems with limited memory, try group offloading onto the disk as a secondary memory.
307+
308+
Set the `offload_to_disk_path` argument in either [`~ModelMixin.enable_group_offload`] or [`~hooks.apply_group_offloading`] to offload the model to the disk.
309+
310+
```py
311+
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", offload_to_disk_path="path/to/disk")
312+
313+
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2, offload_to_disk_path="path/to/disk")
314+
```
315+
316+
Refer to these [two](https://github.com/huggingface/diffusers/pull/11682#issue-3129365363) [tables](https://github.com/huggingface/diffusers/pull/11682#issuecomment-2955715126) to compare the speed and memory trade-offs.
317+
304318
## Layerwise casting
305319

306320
> [!TIP]

docs/source/en/optimization/speed-memory-optims.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ pip install -U bitsandbytes
3737

3838
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
3939

40-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models with `fullgraph=True`.
40+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
4141

4242
```py
4343
import torch
@@ -72,7 +72,7 @@ pipeline("""
7272

7373
In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations.
7474

75-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation.
75+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation and set `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
7676

7777
<hfoptions id="offloading">
7878
<hfoption id="model CPU offloading">
@@ -85,6 +85,7 @@ from diffusers import DiffusionPipeline
8585
from diffusers.quantizers import PipelineQuantizationConfig
8686

8787
torch._dynamo.config.cache_size_limit = 1000
88+
torch._dynamo.config.capture_dynamic_output_shape_ops = True
8889

8990
# quantize
9091
pipeline_quant_config = PipelineQuantizationConfig(
@@ -125,6 +126,7 @@ from diffusers.quantizers import PipelineQuantizationConfig
125126
from transformers import UMT5EncoderModel
126127

127128
torch._dynamo.config.cache_size_limit = 1000
129+
torch._dynamo.config.capture_dynamic_output_shape_ops = True
128130

129131
# quantize
130132
pipeline_quant_config = PipelineQuantizationConfig(

0 commit comments

Comments
 (0)