You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
303
303
304
+
#### Offloading to disk
305
+
306
+
Group offloading can consume significant system memory depending on the model size. On systems with limited memory, try group offloading onto the disk as a secondary memory.
307
+
308
+
Set the `offload_to_disk_path` argument in either [`~ModelMixin.enable_group_offload`] or [`~hooks.apply_group_offloading`] to offload the model to the disk.
Refer to these [two](https://github.com/huggingface/diffusers/pull/11682#issue-3129365363)[tables](https://github.com/huggingface/diffusers/pull/11682#issuecomment-2955715126) to compare the speed and memory trade-offs.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/speed-memory-optims.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ pip install -U bitsandbytes
37
37
38
38
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
39
39
40
-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models with `fullgraph=True`.
40
+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
41
41
42
42
```py
43
43
import torch
@@ -72,7 +72,7 @@ pipeline("""
72
72
73
73
In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations.
74
74
75
-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`cache_size_limit` during offloading to avoid excessive recompilation.
75
+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`cache_size_limit` during offloading to avoid excessive recompilation and set `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models.
76
76
77
77
<hfoptionsid="offloading">
78
78
<hfoptionid="model CPU offloading">
@@ -85,6 +85,7 @@ from diffusers import DiffusionPipeline
85
85
from diffusers.quantizers import PipelineQuantizationConfig
0 commit comments