Skip to content

Commit 926df99

Browse files
committed
feedback
1 parent a0a16db commit 926df99

File tree

1 file changed

+4
-8
lines changed

1 file changed

+4
-8
lines changed

docs/source/en/optimization/speed-memory-optims.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,9 @@ specific language governing permissions and limitations under the License.
1212

1313
# Compile and offloading quantized models
1414

15-
When optimizing models, you often face trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it comes at the cost of increased memory consumption since it needs to store intermediate attention layer outputs.
15+
Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading).
1616

17-
A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). This approach not only accelerates inference but also helps lower memory-usage.
18-
19-
For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU.
17+
For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective for image generation because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU.
2018

2119
For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound.
2220

@@ -27,7 +25,7 @@ The table below provides a comparison of optimization strategy combinations and
2725
| quantization | 32.602 | 14.9453 |
2826
| quantization, torch.compile | 25.847 | 14.9448 |
2927
| quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 |
30-
<small>These results are benchmarked on Flux with a RTX 4090. The `transformer` and `text_encoder` components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
28+
<small>These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
3129

3230
This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.
3331

@@ -39,7 +37,7 @@ pip install -U bitsandbytes
3937

4038
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
4139

42-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bnb models with `fullgraph=True`.
40+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models with `fullgraph=True`.
4341

4442
```py
4543
import torch
@@ -104,7 +102,6 @@ pipeline = DiffusionPipeline.from_pretrained(
104102
pipeline.enable_model_cpu_offload()
105103

106104
# compile
107-
pipeline.transformer.to(memory_format=torch.channels_last)
108105
pipeline.transformer.compile()
109106
pipeline(
110107
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
@@ -172,7 +169,6 @@ apply_group_offloading(
172169
)
173170

174171
# compile
175-
pipeline.transformer.to(memory_format=torch.channels_last)
176172
pipeline.transformer.compile()
177173

178174
prompt = """

0 commit comments

Comments
 (0)