add ditfastattn in readme, and seperate cogvideo and ditfastattn run …

…scripts. (#298)
xdit-project · Oct 9, 2024 · 8affe0d · 8affe0d
1 parent ae504d6
commit 8affe0d
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 63 deletions.
diff --git a/README.md b/README.md
@@ -27,14 +27,17 @@
   - [Pixart](#perf_pixart)
   - [Latte](#perf_latte)
 - [🚀 QuickStart](#QuickStart)
+- [🖼️ ComfyUI with xDiT](#comfyui)
 - [✨ xDiT's Arsenal](#secrets)
   - [Parallel Methods](#parallel)
     - [1. PipeFusion](#PipeFusion)
     - [2. Unified Sequence Parallel](#USP)
     - [3. Hybrid Parallel](#hybrid_parallel)
     - [4. CFG Parallel](#cfg_parallel)
     - [5. Parallel VAE](#parallel_vae)
-  - [Compilation Acceleration](#compilation)
+  - [Single GPU Acceleration](#1gpuacc)
+    - [Compilation Acceleration](#compilation)
+    - [DiTFastAttn](#dittfastattn)
 - [📚  Develop Guide](#dev-guide)
 - [🚧  History and Looking for Contributions](#history)
 - [📝 Cite Us](#cite-us)
@@ -46,14 +49,23 @@ Diffusion Transformers (DiTs) are driving advancements in high-quality image and
 With the escalating input context length in DiTs, the computational demand of the Attention mechanism grows **quadratically**! 
 Consequently, multi-GPU and multi-machine deployments are essential to meet the **real-time** requirements in online services.
 
+
+<h3 id="meet-xdit-parallel">Parallel Inference</h3>
+
 To meet real-time demand for DiTs applications, parallel inference is a must.
 xDiT is an inference engine designed for the parallel deployment of DiTs on large scale. 
-xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations.
+xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as computation accelerations.
+
+The overview of xDiT is shown as follows.
+
+<picture>
+  <img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_overview.png">
+</picture>
 
 
-1. Sequence Parallelism, [USP](https://arxiv.org/abs/2405.07719) is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.
+1. Sequence Parallelism, [USP](https://arxiv.org/abs/2405.07719) is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention proposed by use3.
 
-2. [PipeFusion](https://arxiv.org/abs/2405.14430), a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.
+2. [PipeFusion](https://arxiv.org/abs/2405.14430), a sequence-level pipeline parallelism, similar to [TeraPipe](https://arxiv.org/abs/2102.07988) but takes advantage of the input temporal redundancy characteristics of diffusion models.
 
 3. Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.
 
@@ -70,15 +82,13 @@ We also have implemented the following parallel stategies for reference:
 2. [DistriFusion](https://arxiv.org/abs/2402.19481)
 
 
-Optimization orthogonal to parallelization focuses on accelerating single GPU performance. 
-In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.
+<h3 id="meet-xdit-perf">Computing Acceleration</h3>
 
-The overview of xDiT is shown as follows.
+Optimization orthogonal to parallel focuses on accelerating single GPU performance.
 
-<picture>
-  <img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_overview.png">
-</picture>
+First, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.
 
+Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.
 
 <h2 id="updates">📢 Updates</h2>
 
@@ -262,14 +272,25 @@ We observed that a warmup of 0 had no effect on the PixArt model.
 Users can tune this value according to their specific tasks.
 
 
+<h2 id="comfyui">🖼️ ComfyUI with xDiT</h2>
 
-### 4. Launch a Http Service
+### 1. Launch ComfyUI
 
-[Launching a Text-to-Image Http Service](./docs/developer/Http_Service.md)
+ComfyUI is currently the most popular way to use Diffusion Models.
+It provides users with a platform for image generation, supporting plugins like LoRA, ControlNet, and IPAdaptor.
+However, since ComfyUI was initially designed for personal computers with single-node, single-GPU capabilities, implementing native parallel acceleration still faces significant compatibility issues. To address this, we've used xDiT with the Ray framework to achieve seamless multi-GPU parallel adaptation on ComfyUI, significantly improving the generation speed of ComfyUI workflows.
+
+Below is an example of using xDiT to accelerate a Flux workflow with LoRA:
+
+![ComfyUI xDiT Demo](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/comfyui/flux-demo.gif)
 
-### 5. Launch ComfyUI
+Currently, if you need the xDiT parallel version for ComfyUI, please contact us via this [email]([email protected]).
 
-[Launching ComfyUI](./docs/developer/ComfyUI_xdit.md)
+### 2. Launch a Http Service
+
+You can also launch a http service to generate images with xDiT.
+
+[Launching a Text-to-Image Http Service](./docs/developer/Http_Service.md)
 
 
 <h2 id="secrets">✨ The xDiT's Arsenal</h2>
@@ -333,7 +354,10 @@ As we can see, PipeFusion and Sequence Parallel achieve lowest communication cos
 
 [Patch Parallel VAE](./docs/methods/parallel_vae.md)
 
-<h3 id="compilation">Compilation Acceleration</h3>
+<h3 id="1gpuacc">Single GPU Acceleration</h3>
+
+
+<h4 id="compilation">Compilation Acceleration</h4>
 
 We utilize two compilation acceleration techniques, [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) and [onediff](https://github.com/siliconflow/onediff), to enhance runtime speed on GPUs. These compilation accelerations are used in conjunction with parallelization methods.
 
@@ -347,6 +371,12 @@ pip install -U nexfort
 For usage instructions, refer to the [example/run.sh](./examples/run.sh). Simply append `--use_torch_compile` or `--use_onediff` to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.
 
 
+<h4 id="dittfastattn">DiTFastAttn</h4>
+
+xDiT also provides DiTFastAttn for single GPU acceleration. It can reduce computation cost of attention layer by leveraging redundancies between different steps of the Diffusion Model.
+
+[DiTFastAttn](./docs/methods/dittfastattn.md)
+
 <h2 id="dev-guide">📚  Develop Guide</h2>
 
 [The implement and design of xdit framework](./docs/developer/The_implement_design_of_xdit_framework.md)

diff --git a/docs/methods/ditfastattn.md b/docs/methods/ditfastattn.md
@@ -0,0 +1,35 @@
+### DiTFastAttn
+
+[DiTFastAttn](https://github.com/thu-nics/DiTFastAttn) is an acceleration solution for single-GPU DiTs inference, utilizing Input Temporal Reduction to reduce computational complexity through the following three methods:
+
+1. Window Attention with Residual Caching to reduce spatial redundancy.
+2. Temporal Similarity Reduction to exploit the similarity between steps.
+3. Conditional Redundancy Elimination to skip redundant computations during conditional generation
+
+Currently, DiTFastAttn can only be used with data parallelism or on a single GPU. It does not support other parallel methods such as USP and PipeFusion. We plan to implement a parallel version of DiTFastAttn in the future.
+
+## Download COCO Dataset
+```
+wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
+unzip annotations_trainval2014.zip
+```
+
+## Running
+
+Modify the dataset path in the script, then run
+
+```
+bash examples/run_fastditattn.sh
+```
+
+## Reference
+
+```
+@misc{yuan2024ditfastattn,
+      title={DiTFastAttn: Attention Compression for Diffusion Transformer Models}, 
+      author={Zhihang Yuan and Pu Lu and Hanling Zhang and Xuefei Ning and Linfeng Zhang and Tianchen Zhao and Shengen Yan and Guohao Dai and Yu Wang},
+      year={2024},
+      eprint={2406.08552},
+      archivePrefix={arXiv},
+}
+```
diff --git a/docs/methods/ditfastattn_zh.md b/docs/methods/ditfastattn_zh.md
@@ -0,0 +1,35 @@
+### DiTFastAttn
+
+[DiTFastAttn](https://github.com/thu-nics/DiTFastAttn)是一种针对单卡DiTs推理的加速方案，利用Input Temperal Reduction通过如下三种方式来减少计算量：
+
+1. Window Attention with Residual Caching to reduce spatial redundancy.
+2. Temporal Similarity Reduction to exploit the similarity between steps.
+3. Conditional Redundancy Elimination to skip redundant computations during conditional generation
+
+目前使用DiTFastAttn只能数据并行，或者单GPU运行。不支持其他方式并行，比如USP和PipeFusion等。我们未来计划实现并行版本的DiTFastAttn。
+
+## 下载COCO数据集
+```
+wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
+unzip annotations_trainval2014.zip
+```
+
+## 运行
+
+在脚本中修改数据集路径，然后运行
+
+```
+bash examples/run_fastditattn.sh
+```
+
+## 引用
+
+```
+@misc{yuan2024ditfastattn,
+      title={DiTFastAttn: Attention Compression for Diffusion Transformer Models}, 
+      author={Zhihang Yuan and Pu Lu and Hanling Zhang and Xuefei Ning and Linfeng Zhang and Tianchen Zhao and Shengen Yan and Guohao Dai and Yu Wang},
+      year={2024},
+      eprint={2406.08552},
+      archivePrefix={arXiv},
+}
+```
diff --git a/examples/run.sh b/examples/run.sh
@@ -1,24 +1,8 @@
 set -x
 
-# export NCCL_PXN_DISABLE=1
-# # export NCCL_DEBUG=INFO
-# export NCCL_SOCKET_IFNAME=eth0
-# export NCCL_IB_GID_INDEX=3
-# export NCCL_IB_DISABLE=0
-# export NCCL_NET_GDR_LEVEL=2
-# export NCCL_IB_QPS_PER_CONNECTION=4
-# export NCCL_IB_TC=160
-# export NCCL_IB_TIMEOUT=22
-# export NCCL_P2P=0
-# export CUDA_DEVICE_MAX_CONNECTIONS=1
-
 export PYTHONPATH=$PWD:$PYTHONPATH
 
 # Select the model type
-# The model is downloaded to a specified location on disk, 
-# or you can simply use the model's ID on Hugging Face, 
-# which will then be downloaded to the default cache path on Hugging Face.
-
 export MODEL_TYPE="Pixart-alpha"
 # Configuration for different model types
 # script, model_id, inference_step
@@ -28,7 +12,6 @@ declare -A MODEL_CONFIGS=(
     ["Sd3"]="sd3_example.py /cfs/dit/stable-diffusion-3-medium-diffusers 20"
     ["Flux"]="flux_example.py /cfs/dit/FLUX.1-schnell 4"
     ["HunyuanDiT"]="hunyuandit_example.py /cfs/dit/HunyuanDiT-v1.2-Diffusers 50"
-    ["CogVideoX"]="cogvideox_example.py /cfs/dit/CogVideoX-2b 20"
 )
 
 if [[ -v MODEL_CONFIGS[$MODEL_TYPE] ]]; then
@@ -42,56 +25,27 @@ fi
 mkdir -p ./results
 
 # task args
-if [ "$MODEL_TYPE" = "CogVideoX" ]; then
-  TASK_ARGS="--height 480 --width 720 --num_frames 9"
-else
-  TASK_ARGS="--height 1024 --width 1024 --no_use_resolution_binning"
-fi
+TASK_ARGS="--height 1024 --width 1024 --no_use_resolution_binning"
 
 # Flux only supports SP. Do not set the pipefusion degree.
 if [ "$MODEL_TYPE" = "Flux" ]; then
 N_GPUS=8
 PARALLEL_ARGS="--ulysses_degree $N_GPUS"
 CFG_ARGS=""
-FAST_ATTN_ARGS=""
-
-# CogVideoX asserts sp_degree == ulysses_degree*ring_degree <= 2. Also, do not set the pipefusion degree.
-elif [ "$MODEL_TYPE" = "CogVideoX" ]; then
-N_GPUS=4
-PARALLEL_ARGS="--ulysses_degree 2 --ring_degree 1"
-CFG_ARGS="--use_cfg_parallel"
-FAST_ATTN_ARGS=""
 
 # HunyuanDiT asserts sp_degree == ulysses_degree*ring_degree <= 2, or the output will be incorrect.
 elif [ "$MODEL_TYPE" = "HunyuanDiT" ]; then
 N_GPUS=8
 PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
 CFG_ARGS="--use_cfg_parallel"
-FAST_ATTN_ARGS=""
-
-# Pixart-alpha can use DiTFastAttn to compression attention module, but DiTFastAttn can only use with data parallel
-elif [ "$MODEL_TYPE" = "Pixart-alpha" ]; then
-N_GPUS=4
-PARALLEL_ARGS="--data_parallel_degree $N_GPUS"
-CFG_ARGS=""
-FAST_ATTN_ARGS="--use_fast_attn --window_size 512 --n_calib 4 --threshold 0.15 --use_cache --coco_path /data/mscoco/annotations/captions_val2014.json"
-
-# Pixart-sigma can use DiTFastAttn to compression attention module, but DiTFastAttn can only use with data parallel
-elif [ "$MODEL_TYPE" = "Pixart-sigma" ]; then
-N_GPUS=4
-PARALLEL_ARGS="--data_parallel_degree $N_GPUS"
-CFG_ARGS=""
-FAST_ATTN_ARGS="--use_fast_attn --window_size 512 --n_calib 4 --threshold 0.15 --use_cache --coco_path /data/mscoco/annotations/captions_val2014.json"
 
 else
 # On 8 gpus, pp=2, ulysses=2, ring=1, cfg_parallel=2 (split batch)
 N_GPUS=8
 PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
 CFG_ARGS="--use_cfg_parallel"
-FAST_ATTN_ARGS=""
 fi
 
-
 # By default, num_pipeline_patch = pipefusion_degree, and you can tune this parameter to achieve optimal performance.
 # PIPEFUSION_ARGS="--num_pipeline_patch 8 "
 
@@ -113,6 +67,5 @@ $OUTPUT_ARGS \
 --warmup_steps 0 \
 --prompt "A small dog" \
 $CFG_ARGS \
-$FAST_ATTN_ARGS \
 $PARALLLEL_VAE \
 $COMPILE_FLAG