siliconflow · lixiang007666 · Jun 24, 2024 · Jun 24, 2024 · Jun 24, 2024 · Jun 24, 2024
diff --git a/imgs/nexfort_sd3_demo.png b/imgs/nexfort_sd3_demo.png
diff --git a/onediff_diffusers_extensions/examples/sd3/README.md b/onediff_diffusers_extensions/examples/sd3/README.md
@@ -3,14 +3,15 @@
 1. [Environment Setup](#environment-setup)
    - [Set Up OneDiff](#set-up-onediff)
    - [Set Up NexFort Backend](#set-up-nexfort-backend)
-   - [Set Up Diffusers Library](#set-up-diffusers-library)
+   - [Set Up Diffusers](#set-up-diffusers)
    - [Download SD3 Model for Diffusers](#download-sd3-model-for-diffusers)
 2. [Execution Instructions](#execution-instructions)
    - [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
    - [Run With Compilation](#run-with-compilation)
 3. [Performance Comparison](#performance-comparison)
 4. [Dynamic Shape for SD3](#dynamic-shape-for-sd3)
-5. [Quality](#quality)
+5. [Quantization](#quantization)
+6. [Quality](#quality)
 
 ## Environment setup
 ### Set up onediff
@@ -25,12 +26,12 @@ https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/back
 # Ensure diffusers include the SD3 pipeline.
 pip3 install --upgrade diffusers[torch]
 ```
-### Set up SD3
+### Download SD3 model for diffusers
 Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
 
 HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md
 
-## Run
+## Execution instructions
 
 ### Run 1024*1024 without compile (the original pytorch HF diffusers baseline)
 ```
@@ -42,22 +43,22 @@ python3 onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py \
 
 ```
 python3 onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py \
-    --compiler-config '{"mode": "max-optimize:max-autotune:low-precision:cache-all:freezing:benchmark", "memory_format": "channels_last"}' \
+    --compiler-config '{"mode": "max-optimize:max-autotune:low-precision:cudagraphs:cache-all:freezing:benchmark", "memory_format": "channels_last"}' \
     --saved-image sd3_compile.png
 ```
 
 ## Performance comparation
 
-Testing on H800-NVL-80GB, with image size of 1024*1024, iterating 28 steps:
+Testing on H800-NVL-80GB with torch 2.3.0, with image size of 1024*1024, iterating 28 steps:
 | Metric                                           |                                     |
 | ------------------------------------------------ | ----------------------------------- |
-| Data update date(yyyy-mm-dd)                     | 2024-06-24                          |
-| PyTorch iteration speed                          | 15.56 it/s                          |
-| OneDiff iteration speed                          | 25.91 it/s (+66.5%)                 |
-| PyTorch E2E time                                 | 1.96 s                              |
-| OneDiff E2E time                                 | 1.15 s (-41.3%)                     |
-| PyTorch Max Mem Used                             | 18.784 GiB                          |
-| OneDiff Max Mem Used                             | 18.324 GiB                          |
+| Data update date(yyyy-mm-dd)                     | 2024-06-25                          |
+| PyTorch iteration speed                          | 15.11 it/s                          |
+| OneDiff iteration speed                          | 25.14 it/s (+66.4%)                 |
+| PyTorch E2E time                                 | 2.03 s                              |
+| OneDiff E2E time                                 | 1.21 s (-40.1%)                     |
+| PyTorch Max Mem Used                             | 18.788 GiB                          |
+| OneDiff Max Mem Used                             | 17.926 GiB                          |
 | PyTorch Warmup with Run time                     | 2.86 s                              |
 | OneDiff Warmup with Compilation time<sup>1</sup> | 889.25 s                            |
 | OneDiff Warmup with Cache time                   | 44.38 s                             |
@@ -95,6 +96,42 @@ python3 onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py \
     --run_multiple_resolutions 1 \
     --saved-image sd3_compile.png
 ```
+## Quantization
+
+Note: Quantization is a feature for Onediff enterprise.
+
+### Run
+
+Quantization of the model's layers can be selectively performed based on precision. Download `fp8_e4m3.json` or `fp8_e4m3_per_tensor.json` from https://huggingface.co/siliconflow/stable-diffusion-3-onediff-nexfort-fp8.
+
+The --arg `quant-submodules-config-path` is optional. If left `None`, it will quantize all linear layers.
+
+```
+# Applies dynamic symmetric per-tensor activation and per-tensor weight quantization to all linear layers. Both activations and weights are quantized to e4m3 format.
+python3 onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py \
+    --compiler-config '{"mode": "quant:max-optimize:max-autotune:low-precision:cudagraphs:freezing:benchmark", "memory_format": "channels_last"}' \
+    --quantize-config '{"quant_type": "fp8_e4m3_e4m3_dynamic_per_tensor"}' \
+    --quant-submodules-config-path /path/to/fp8_e4m3_per_tensor.json \
+    --saved-image sd3_fp8.png
+```
+or
+```
+# Applies dynamic symmetric per-token activation and per-channel weight quantization to all linear layers.
+python3 onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py \
+    --compiler-config '{"mode": "quant:max-optimize:max-autotune:low-precision:cudagraphs:freezing:benchmark", "memory_format": "channels_last"}' \
+    --quantize-config '{"quant_type": "fp8_e4m3_e4m3_dynamic"}' \
+    --quant-submodules-config-path /path/to/fp8_e4m3.json \
+    --saved-image sd3_fp8.png
+```
+
+### Metric
+
+The performance of above quantization types on the H800-NVL-80GB is as follows:
+
+| quant_type                       | E2E Inference Time | Iteration speed    | Max Used CUDA Memory |
+|----------------------------------|--------------------|--------------------|----------------------|
+| fp8_e4m3_e4m3_dynamic            | 1.15 s (-43.4%)    | 26.30 it/s (+74.1%)| 16.933 GiB           |
+| fp8_e4m3_e4m3_dynamic_per_tensor | 1.09 s (-46.3%)    | 27.75 it/s (+83.7%)| 17.098 GiB           |
 
 ## Quality
 When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

diff --git a/onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py b/onediff_diffusers_extensions/examples/sd3/text_to_image_sd3.py
@@ -26,7 +26,7 @@ def parse_args():
     parser.add_argument(
         "--prompt",
         type=str,
-        default="photo of a dog and a cat both standing on a red box, with a blue ball in the middle with a parrot standing on top of the ball. The box has the text 'onediff'",
+        default="photo of a dog and a cat both standing on a red box, with a blue ball in the middle with a parrot standing on top of the ball. The box has the text 'nexfort'",
         help="Prompt for the image generation.",
     )
     parser.add_argument(
@@ -54,7 +54,7 @@ def parse_args():
         help="Path to save the generated image.",
     )
     parser.add_argument(
-        "--seed", type=int, default=1, help="Seed for random number generation."
+        "--seed", type=int, default=2, help="Seed for random number generation."
     )
     parser.add_argument(
         "--run_multiple_resolutions",
@@ -66,6 +66,7 @@ def parse_args():
         type=(lambda x: str(x).lower() in ["true", "1", "yes"]),
         default=False,
     )
+    parser.add_argument("--quant-submodules-config-path", type=str, default=None)
     return parser.parse_args()
 
 
@@ -155,7 +156,17 @@ def compile_pipe(self, pipe, compiler_config):
         return pipe
 
     def quantize_pipe(self, pipe, quantize_config):
-        pipe = quantize_pipe(pipe, ignores=[], **quantize_config)
+        if args.quant_submodules_config_path:
+            # Quantitative submodules configuration file download: https://huggingface.co/siliconflow/stable-diffusion-3-onediff-nexfort-fp8
+            pipe = quantize_pipe(
+                pipe,
+                quant_submodules_config_path=args.quant_submodules_config_path,
+                top_percentage=75,
+                ignores=[],
+                **quantize_config,
+            )
+        else:
+            pipe = quantize_pipe(pipe, ignores=[], **quantize_config)
         return pipe