xdit-project · feifeibear · Feb 24, 2025 · Feb 24, 2025 · Feb 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -9,5 +9,4 @@ profile/
 .vscode/
 xfuser.egg-info/
 dist/*
-latte_output.mp4
-cache/
+*.mp4
diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@
     - [5. Parallel VAE](#parallel_vae)
   - [Single GPU Acceleration](#1gpuacc)
     - [Compilation Acceleration](#compilation)
-    - [DiTFastAttn](#dittfastattn)
+    - [Cache Acceleration](#cache_acceleration)
 - [📚  Develop Guide](#dev-guide)
 - [🚧  History and Looking for Contributions](#history)
 - [📝 Cite Us](#cite-us)
@@ -86,26 +86,14 @@ We also have implemented the following parallel strategies for reference:
 
 <h3 id="meet-xdit-cache">Cache Acceleration</h3>
 
-Cache method is inspired by work from TeaCache(https://github.com/ali-vilab/TeaCache.git) and ParaAttn(https://github.com/chengzeyi/ParaAttention.git); We adapted the TeaCache and First-Block-Cache in xDiT.
-
-This method is not orthogonal to parallel in xDiT. Only when SP or no parrallelism can activate the cache function.
-
-To use this functionality, you can activate it by `--use_teacache` or `--use_fbcache`, which activate TeaCache and First-Block-Cache respectively. Right now, this repo only supports FLUX model.
-
-The Performance shown as below, tested on 4 H20 with SP=4:
-| 方法           | 性能   |
-|----------------|--------|
-| 原始           | 2.02s  |
-| use_teacache   | 1.58s  |
-| use_fbcache    | 0.93s  |
+Cache method, including [TeaCache](https://github.com/ali-vilab/TeaCache.git), [First-Block-Cache](https://github.com/chengzeyi/ParaAttention.git) and [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.
 
 <h3 id="meet-xdit-perf">Computing Acceleration</h3>
 
 Optimization is orthogonal to parallel and focuses on accelerating performance on a single GPU.
 
 First, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.
 
-Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.
 
 <h2 id="updates">📢 Updates</h2>
 
@@ -147,12 +135,9 @@ Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https:
 
 </div>
 
-### Supported by legacy version only, including DistriFusion and Tensor Parallel as the standalone parallel strategies:
 
-<div align="center">
+[🔴 DiT-XL](https://huggingface.co/facebook/DiT-XL-2-256) is supported by legacy version only, including DistriFusion and Tensor Parallel as the standalone parallel strategies:
 
-[🔴 DiT-XL](https://huggingface.co/facebook/DiT-XL-2-256)
-</div>
 
 
 <h2 id="comfyui">🖼️ TACO-DiT: ComfyUI with xDiT</h2>
@@ -297,9 +282,9 @@ You can also launch an HTTP service to generate images with xDiT.
 
 <h2 id="dev-guide">📚  Develop Guide</h2>
 
-We provide different difficulty levels for adding new models, please refer to the following tutorial.
+We provide a step-by-step guide for adding new models, please refer to the following tutorial.
 
-[Manual for adding new models](./docs/developer/adding_models/readme.md)
+[Apply xDiT to new models](./docs/developer/adding_models/readme.md)
 
 A high-level design of xDiT framework is provided below, which may help you understand the xDiT framework.
 
@@ -382,8 +367,10 @@ pip install -U nexfort
 
 For usage instructions, refer to the [example/run.sh](./examples/run.sh). Simply append `--use_torch_compile` or `--use_onediff` to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.
 
+<h4 id="cache_acceleration">Cache Acceleration</h4>
 
-<h4 id="dittfastattn">DiTFastAttn</h4>
+You can use `--use_teacache` or `--use_fbcache` in examples/run.sh, which applies TeaCache and First-Block-Cache respectively. 
+Note, cache method is only supported for FLUX model with USP. It is currently not applicable for PipeFusion.
 
 xDiT also provides DiTFastAttn for single GPU acceleration. It can reduce the computation cost of attention layers by leveraging redundancies between different steps of the Diffusion Model.
 

diff --git a/docs/developer/adding_models/readme.md b/docs/developer/adding_models/readme.md
@@ -1,4 +1,4 @@
-# Parallelize New Models with xDiT
+# Apply xDiT to new models
 
 xDiT was initially developed to accelerate the inference process of Diffusion Transformers (DiTs) within Huggingface `diffusers`. However, with the rapid emergence of various DiT models, you may find yourself needing to support new models that xDiT hasn't yet accommodated or models that are not officially supported by `diffusers` at all.
 

diff --git a/docs/performance/flux.md b/docs/performance/flux.md
@@ -105,7 +105,7 @@ This is due to the increased memory requirements for activations, along with mem
 
 By leveraging Parallel VAE, xDiT is able to demonstrate its capability for generating images at higher resolutions, enabling us to produce images with even greater detail and clarity. Applying `--use_parallel_vae` in the [runing script](../../examples/run.sh).
 
-prompt is "A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. He’s wearing a faded blue captain’s hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."
+prompt is "A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. He's wearing a faded blue captain's hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."
 
 The quality of image generation at 2048px, 3072px, and 4096px resolutions is as follows. It is evident that the quality of the 4096px generated images is significantly lower.
 
@@ -114,3 +114,18 @@ The quality of image generation at 2048px, 3072px, and 4096px resolutions is as
     alt="latency-flux_l40">
 </div>
 
+
+## Cache Methods
+
+We tested the performance of TeaCache and First-Block-Cache on 4xH20 with SP=4.
+The Performance shown as below:
+
+<div align="center">
+
+| Method          | Latency (s) |
+|----------------|--------|
+| Baseline       | 2.02s  |
+| use_teacache   | 1.58s  |
+| use_fbcache    | 0.93s  |
+
+</div>
diff --git a/examples/flux_example.py b/examples/flux_example.py
@@ -16,7 +16,7 @@
     get_tensor_model_parallel_world_size,
     get_data_parallel_world_size,
 )
-
+from xfuser.model_executor.cache.diffusers_adapters import apply_cache_on_transformer
 
 def main():
     parser = FlexibleArgumentParser(description="xFuser Arguments")
@@ -49,7 +49,7 @@ def main():
     parameter_peak_memory = torch.cuda.max_memory_allocated(device=f"cuda:{local_rank}")
 
     pipe.prepare_run(input_config, steps=1)
-    from xfuser.model_executor.plugins.cache_.diffusers_adapters import apply_cache_on_transformer
+
     use_cache = engine_args.use_teacache or engine_args.use_fbcache
     if (use_cache
         and get_pipeline_parallel_world_size() == 1

diff --git a/xfuser/model_executor/plugins/__init__.py → xfuser/model_executor/cache/__init__.py b/xfuser/model_executor/plugins/__init__.py → xfuser/model_executor/cache/__init__.py
diff --git a/...ins/cache_/diffusers_adapters/__init__.py → ...utor/cache/diffusers_adapters/__init__.py b/...ins/cache_/diffusers_adapters/__init__.py → ...utor/cache/diffusers_adapters/__init__.py
@@ -4,7 +4,7 @@
 """
 import importlib
 from typing import Type, Dict, TypeVar
-from xfuser.model_executor.plugins.cache_.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
+from xfuser.model_executor.cache.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
 
 
 def apply_cache_on_transformer(transformer, *args, **kwargs):

diff --git a/...plugins/cache_/diffusers_adapters/flux.py → ...executor/cache/diffusers_adapters/flux.py b/...plugins/cache_/diffusers_adapters/flux.py → ...executor/cache/diffusers_adapters/flux.py
@@ -8,9 +8,9 @@
 import torch
 from torch import nn
 from diffusers import DiffusionPipeline, FluxTransformer2DModel
-from xfuser.model_executor.plugins.cache_.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
+from xfuser.model_executor.cache.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
 
-from xfuser.model_executor.plugins.cache_ import utils
+from xfuser.model_executor.cache import utils
 
 def create_cached_transformer_blocks(use_cache, transformer, rel_l1_thresh, return_hidden_states_first, num_steps):
     cached_transformer_class = {

diff --git a/...ins/cache_/diffusers_adapters/registry.py → ...utor/cache/diffusers_adapters/registry.py b/...ins/cache_/diffusers_adapters/registry.py → ...utor/cache/diffusers_adapters/registry.py
diff --git a/...er/model_executor/plugins/cache_/utils.py → xfuser/model_executor/cache/utils.py b/...er/model_executor/plugins/cache_/utils.py → xfuser/model_executor/cache/utils.py
diff --git a/xfuser/model_executor/plugins/cache_/__init__.py b/xfuser/model_executor/plugins/cache_/__init__.py