diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 5492dff04cae..31a7c45035db 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -163,7 +163,7 @@ title: Training - sections: - local: quantization/overview - title: Getting Started + title: Getting started - local: quantization/bitsandbytes title: bitsandbytes - local: quantization/gguf diff --git a/docs/source/en/api/quantization.md b/docs/source/en/api/quantization.md index e2ca990190e6..5846beac9e9e 100644 --- a/docs/source/en/api/quantization.md +++ b/docs/source/en/api/quantization.md @@ -27,19 +27,19 @@ Learn how to quantize models in the [Quantization](../quantization/overview) gui ## BitsAndBytesConfig -[[autodoc]] BitsAndBytesConfig +[[autodoc]] quantizers.quantization_config.BitsAndBytesConfig ## GGUFQuantizationConfig -[[autodoc]] GGUFQuantizationConfig +[[autodoc]] quantizers.quantization_config.GGUFQuantizationConfig ## QuantoConfig -[[autodoc]] QuantoConfig +[[autodoc]] quantizers.quantization_config.QuantoConfig ## TorchAoConfig -[[autodoc]] TorchAoConfig +[[autodoc]] quantizers.quantization_config.TorchAoConfig ## DiffusersQuantizer diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md index cc5a7e2891bb..103dcddb73e9 100644 --- a/docs/source/en/quantization/overview.md +++ b/docs/source/en/quantization/overview.md @@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> -# Quantization +# Getting started Quantization focuses on representing data with fewer bits while also trying to preserve the precision of the original data. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. @@ -19,19 +19,25 @@ Diffusers supports multiple quantization backends to make large diffusion models ## Pipeline-level quantization -There are two ways you can use [`~quantizers.PipelineQuantizationConfig`] depending on the level of control you want over the quantization specifications of each model in the pipeline. +There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending on how much customization you want to apply to the quantization configuration. -- for more basic and simple use cases, you only need to define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` -- for more granular quantization control, provide a `quant_mapping` that provides the quantization specifications for the individual model components +- for basic use cases, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments +- for granular quantization control, define a `quant_mapping` that provides the quantization configuration for individual model components -### Simple quantization +### Basic quantization Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters. - `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`. -- `quant_kwargs` contains the specific quantization arguments to use. +- `quant_kwargs` specifies the quantization arguments to use. + +> [!TIP] +> These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend. + - `components_to_quantize` specifies which components of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact. +The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`. + ```py import torch from diffusers import DiffusionPipeline @@ -56,13 +62,13 @@ pipe = DiffusionPipeline.from_pretrained( image = pipe("photo of a cute dog").images[0] ``` -### quant_mapping +### Advanced quantization -The `quant_mapping` argument provides more flexible options for how to quantize each individual component in a pipeline, like combining different quantization backends. +The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends. Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder. -The example below uses two quantization backends, [`~quantizers.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder. +The example below uses two quantization backends, [`~quantizers.quantization_config.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder. ```py import torch