Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactory cache directory and update readme #452

Merged
merged 2 commits into from
Feb 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,4 @@ profile/
.vscode/
xfuser.egg-info/
dist/*
latte_output.mp4
cache/
*.mp4
29 changes: 8 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
- [5. Parallel VAE](#parallel_vae)
- [Single GPU Acceleration](#1gpuacc)
- [Compilation Acceleration](#compilation)
- [DiTFastAttn](#dittfastattn)
- [Cache Acceleration](#cache_acceleration)
- [📚 Develop Guide](#dev-guide)
- [🚧 History and Looking for Contributions](#history)
- [📝 Cite Us](#cite-us)
Expand Down Expand Up @@ -86,26 +86,14 @@ We also have implemented the following parallel strategies for reference:

<h3 id="meet-xdit-cache">Cache Acceleration</h3>

Cache method is inspired by work from TeaCache(https://github.com/ali-vilab/TeaCache.git) and ParaAttn(https://github.com/chengzeyi/ParaAttention.git); We adapted the TeaCache and First-Block-Cache in xDiT.

This method is not orthogonal to parallel in xDiT. Only when SP or no parrallelism can activate the cache function.

To use this functionality, you can activate it by `--use_teacache` or `--use_fbcache`, which activate TeaCache and First-Block-Cache respectively. Right now, this repo only supports FLUX model.

The Performance shown as below, tested on 4 H20 with SP=4:
| 方法 | 性能 |
|----------------|--------|
| 原始 | 2.02s |
| use_teacache | 1.58s |
| use_fbcache | 0.93s |
Cache method, including [TeaCache](https://github.com/ali-vilab/TeaCache.git), [First-Block-Cache](https://github.com/chengzeyi/ParaAttention.git) and [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.

<h3 id="meet-xdit-perf">Computing Acceleration</h3>

Optimization is orthogonal to parallel and focuses on accelerating performance on a single GPU.

First, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.

Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.

<h2 id="updates">📢 Updates</h2>

Expand Down Expand Up @@ -147,12 +135,9 @@ Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https:

</div>

### Supported by legacy version only, including DistriFusion and Tensor Parallel as the standalone parallel strategies:

<div align="center">
[🔴 DiT-XL](https://huggingface.co/facebook/DiT-XL-2-256) is supported by legacy version only, including DistriFusion and Tensor Parallel as the standalone parallel strategies:

[🔴 DiT-XL](https://huggingface.co/facebook/DiT-XL-2-256)
</div>


<h2 id="comfyui">🖼️ TACO-DiT: ComfyUI with xDiT</h2>
Expand Down Expand Up @@ -297,9 +282,9 @@ You can also launch an HTTP service to generate images with xDiT.

<h2 id="dev-guide">📚 Develop Guide</h2>

We provide different difficulty levels for adding new models, please refer to the following tutorial.
We provide a step-by-step guide for adding new models, please refer to the following tutorial.

[Manual for adding new models](./docs/developer/adding_models/readme.md)
[Apply xDiT to new models](./docs/developer/adding_models/readme.md)

A high-level design of xDiT framework is provided below, which may help you understand the xDiT framework.

Expand Down Expand Up @@ -382,8 +367,10 @@ pip install -U nexfort

For usage instructions, refer to the [example/run.sh](./examples/run.sh). Simply append `--use_torch_compile` or `--use_onediff` to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.

<h4 id="cache_acceleration">Cache Acceleration</h4>

<h4 id="dittfastattn">DiTFastAttn</h4>
You can use `--use_teacache` or `--use_fbcache` in examples/run.sh, which applies TeaCache and First-Block-Cache respectively.
Note, cache method is only supported for FLUX model with USP. It is currently not applicable for PipeFusion.

xDiT also provides DiTFastAttn for single GPU acceleration. It can reduce the computation cost of attention layers by leveraging redundancies between different steps of the Diffusion Model.

Expand Down
2 changes: 1 addition & 1 deletion docs/developer/adding_models/readme.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Parallelize New Models with xDiT
# Apply xDiT to new models

xDiT was initially developed to accelerate the inference process of Diffusion Transformers (DiTs) within Huggingface `diffusers`. However, with the rapid emergence of various DiT models, you may find yourself needing to support new models that xDiT hasn't yet accommodated or models that are not officially supported by `diffusers` at all.

Expand Down
17 changes: 16 additions & 1 deletion docs/performance/flux.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ This is due to the increased memory requirements for activations, along with mem

By leveraging Parallel VAE, xDiT is able to demonstrate its capability for generating images at higher resolutions, enabling us to produce images with even greater detail and clarity. Applying `--use_parallel_vae` in the [runing script](../../examples/run.sh).

prompt is "A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. Hes wearing a faded blue captains hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."
prompt is "A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. He's wearing a faded blue captain's hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."

The quality of image generation at 2048px, 3072px, and 4096px resolutions is as follows. It is evident that the quality of the 4096px generated images is significantly lower.

Expand All @@ -114,3 +114,18 @@ The quality of image generation at 2048px, 3072px, and 4096px resolutions is as
alt="latency-flux_l40">
</div>


## Cache Methods

We tested the performance of TeaCache and First-Block-Cache on 4xH20 with SP=4.
The Performance shown as below:

<div align="center">

| Method | Latency (s) |
|----------------|--------|
| Baseline | 2.02s |
| use_teacache | 1.58s |
| use_fbcache | 0.93s |

</div>
4 changes: 2 additions & 2 deletions examples/flux_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
get_tensor_model_parallel_world_size,
get_data_parallel_world_size,
)

from xfuser.model_executor.cache.diffusers_adapters import apply_cache_on_transformer

def main():
parser = FlexibleArgumentParser(description="xFuser Arguments")
Expand Down Expand Up @@ -49,7 +49,7 @@ def main():
parameter_peak_memory = torch.cuda.max_memory_allocated(device=f"cuda:{local_rank}")

pipe.prepare_run(input_config, steps=1)
from xfuser.model_executor.plugins.cache_.diffusers_adapters import apply_cache_on_transformer

use_cache = engine_args.use_teacache or engine_args.use_fbcache
if (use_cache
and get_pipeline_parallel_world_size() == 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"""
import importlib
from typing import Type, Dict, TypeVar
from xfuser.model_executor.plugins.cache_.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
from xfuser.model_executor.cache.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY


def apply_cache_on_transformer(transformer, *args, **kwargs):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
import torch
from torch import nn
from diffusers import DiffusionPipeline, FluxTransformer2DModel
from xfuser.model_executor.plugins.cache_.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY
from xfuser.model_executor.cache.diffusers_adapters.registry import TRANSFORMER_ADAPTER_REGISTRY

from xfuser.model_executor.plugins.cache_ import utils
from xfuser.model_executor.cache import utils

def create_cached_transformer_blocks(use_cache, transformer, rel_l1_thresh, return_hidden_states_first, num_steps):
cached_transformer_class = {
Expand Down
4 changes: 0 additions & 4 deletions xfuser/model_executor/plugins/cache_/__init__.py

This file was deleted.

Loading