Skip to content

Commit

Permalink
update docs and remove figure (#287)
Browse files Browse the repository at this point in the history
  • Loading branch information
xibosun authored Sep 24, 2024
1 parent 6cf25eb commit 9675e39
Show file tree
Hide file tree
Showing 54 changed files with 142 additions and 103 deletions.
16 changes: 8 additions & 8 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,20 @@ jobs:
continue-on-error: true
steps:
- name: unzip
run: rm -rf ~/xDiT
- run: mkdir ~/xDiT
- run: unzip ~/xDiT.zip -d ~/xDiT
run: rm -rf ~/xDiT_${{github.run_number}}
- run: mkdir ~/xDiT_${{github.run_number}}
- run: unzip ~/xDiT.zip -d ~/xDiT_${{github.run_number}}
- name: Setup docker
run: docker run --rm --name xfuser_test_docker_${{github.repository_owner_id}} -d -i -t --runtime=nvidia --gpus all -v /cfs:/cfs -v /mnt:/mnt -v ~/xDiT:/code xfuser_cicd/test-py_3_11-torch_2_4_1 /bin/bash
run: docker run --rm --name xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} -d -i -t --runtime=nvidia --gpus all -v /cfs:/cfs -v /mnt:/mnt -v ~/xDiT_${{github.run_number}}:/code xfuser_cicd/test-py_3_11-torch_2_4_1 /bin/bash
- name: Install xfuser
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} pip3.11 install -e .
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} pip3.11 install -e .
- name: Test xfuser
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} sh -c "torchrun --nproc_per_node=8 ./examples/sd3_example.py --model /cfs/dit/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1 --height 1024 --width 1024 --no_use_resolution_binning --num_inference_steps 20 --warmup_steps 0 --prompt 'A small dog' --use_cfg_parallel"
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} sh -c "torchrun --nproc_per_node=8 ./examples/sd3_example.py --model /cfs/dit/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1 --height 1024 --width 1024 --no_use_resolution_binning --num_inference_steps 20 --warmup_steps 0 --prompt 'A small dog' --use_cfg_parallel"
clear-env:
needs: setup-env-and-test
runs-on: [self-hosted, linux, x64]
steps:
- name: Remove Files
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} sh -c "rm -r *"
run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} sh -c "rm -r *"
- name: Destroy docker
run: docker stop xfuser_test_docker_${{github.repository_owner_id}}
run: docker stop xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}}
61 changes: 31 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

<picture>
<img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/XDiTlogo.png" width="50%">
</picture>

</p>
<h3>A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters</h3>
Expand Down Expand Up @@ -76,13 +75,13 @@ In addition to utilizing well-known Attention optimization libraries, we leverag
The overview of xDiT is shown as follows.

<picture>
<img alt="xDiT" src="./assets/methods/xdit_overview.png">
<img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_overview.png">
</picture>


<h2 id="updates">📢 Updates</h2>

* ⚙️**August 30, 2024**: Supporting(WIP) CogVideoX. The inference scripts are [examples/latte_example](examples/cogvideox_example.py).
* 🎉**September 23, 2024**: Support CogVideoX. The inference scripts are [examples/cogvideox_example](examples/cogvideox_example.py).
* 🎉**August 26, 2024**: We apply torch.compile and [onediff](https://github.com/siliconflow/onediff) nexfort backend to accelerate GPU kernels speed.
* 🎉**August 9, 2024**: Support Latte sequence parallel version. The inference scripts are [examples/latte_example](examples/latte_example.py).
* 🎉**August 8, 2024**: Support Flux sequence parallel version. The inference scripts are [examples/flux_example](examples/flux_example.py).
Expand All @@ -100,7 +99,7 @@ The overview of xDiT is shown as follows.

| Model Name | CFG | SP | PipeFusion |
| --- | --- | --- | --- |
| [🎬 CogVideoX](https://huggingface.co/THUDM/CogVideoX-2b) | | ||
| [🎬 CogVideoX](https://huggingface.co/THUDM/CogVideoX-2b) | ✔️ | ✔️ ||
| [🎬 Latte](https://huggingface.co/maxin-cn/Latte-1) || ✔️ ||
| [🔵 HunyuanDiT-v1.2-Diffusers](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers) | ✔️ | ✔️ | ✔️ |
| [🟠 Flux](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | NA | ✔️ ||
Expand All @@ -119,25 +118,29 @@ The overview of xDiT is shown as follows.

<h2 id="perf">📈 Performance</h2>

<h3 id="perf_flux">CogVideo</h3>

1. [CogVideo Performance Report](./docs/performance/cogvideo.md)

<h3 id="perf_flux">Flux.1</h3>

1. [Flux Performance Report](./docs/performance/flux.md)
2. [Flux Performance Report](./docs/performance/flux.md)

<h3 id="perf_latte">Latte</h3>

3. [Latte Performance Report](./docs/performance/latte.md)

<h3 id="perf_hunyuandit">HunyuanDiT</h3>

2. [HunyuanDiT Performance Report](./docs/performance/hunyuandit.md)
4. [HunyuanDiT Performance Report](./docs/performance/hunyuandit.md)

<h3 id="perf_sd3">SD3</h3>

3. [Stable Diffusion 3 Performance Report](./docs/performance/sd3.md)
5. [Stable Diffusion 3 Performance Report](./docs/performance/sd3.md)

<h3 id="perf_pixart">Pixart</h3>

4. [Pixart-Alpha Performance Report (legacy)](./docs/performance/pixart_alpha_legacy.md)

<h3 id="perf_latte">Pixart</h3>

5. [Latte Performance Report](./docs/performance/latte.md)
6. [Pixart-Alpha Performance Report (legacy)](./docs/performance/pixart_alpha_legacy.md)


<h2 id="QuickStart">🚀 QuickStart</h2>
Expand Down Expand Up @@ -282,34 +285,32 @@ For the VAE module, xDiT offers a parallel implementation, [DistVAE](https://git
The (<span style="color: red;">xDiT</span>) highlights the methods first proposed by use.
<div align="center">
<img src="assets/methods/xdit_method.png" alt="xdit methods">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_method.png" alt="xdit methods">
</div>
The communication and memory costs associated with the aforementioned intra-image parallelism, except for the CFG and DP (they are inter-image parallel), in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)
As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.
𝒑: Number of pixels;
𝒉𝒔: Model hidden size;
𝑳: Number of model layers;
𝑷: Total model parameters;
𝑵: Number of parallel devices;
𝑴: Number of patch splits;
𝑸𝑶: Query and Output parameter count;
𝑲𝑽: KV Activation parameter count;
𝒑: Number of pixels;\
𝒉𝒔: Model hidden size;\
𝑳: Number of model layers;\
𝑷: Total model parameters;\
𝑵: Number of parallel devices;\
𝑴: Number of patch splits;\
𝑸𝑶: Query and Output parameter count;\
𝑲𝑽: KV Activation parameter count;\
𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;
<div align="center">
| | attn-KV | communication cost | param memory | activations memory | extra buff memory |
|:--------:|:-------:|:-----------------:|:-----:|:-----------:|:----------:|
| Tensor Parallel | fresh | $4O(p \times hs)L$ | $\frac{1}{N}P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| DistriFusion* | stale | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $2AL = (KV)L$ |
| Ring Sequence Parallel* | fresh | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| Ulysses Sequence Parallel | fresh | $\frac{4}{N}O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| PipeFusion* | stale- | $2O(p \times hs)$ | $\frac{1}{N}P$ | $\frac{2}{M}A = \frac{1}{M}QO$ | $\frac{2L}{N}A = \frac{1}{N}(KV)L$ |
| | attn-KV | communication cost | param memory | activations memory | extra buff memory |
|:-------------------------:|:-------:|:----------------------------:|:--------------:|:------------------------------:|:----------------------------------:|
| Tensor Parallel | fresh | $4O(p \times hs)L$ | $\frac{1}{N}P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| DistriFusion* | stale | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $2AL = (KV)L$ |
| Ring Sequence Parallel* | fresh | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| Ulysses Sequence Parallel | fresh | $\frac{4}{N}O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
| PipeFusion* | stale- | $2O(p \times hs)$ | $\frac{1}{N}P$ | $\frac{2}{M}A = \frac{1}{M}QO$ | $\frac{2L}{N}A = \frac{1}{N}(KV)L$ |
</div>
<h4 id="PipeFusion">1.1. PipeFusion</h4>
Expand Down
Binary file removed assets/XDiTlogo.png
Binary file not shown.
Binary file removed assets/developer/class_structure.png
Binary file not shown.
Binary file removed assets/image_quality.png
Binary file not shown.
Binary file removed assets/latency-A100-NVLink.png
Binary file not shown.
Binary file removed assets/latency-A100-PCIe.png
Binary file not shown.
Binary file removed assets/latency-L20.jpg
Binary file not shown.
Binary file removed assets/latency-L20.png
Binary file not shown.
Binary file removed assets/latency-T4.png
Binary file not shown.
Binary file removed assets/methods/hybrid_pp_scheme.png
Binary file not shown.
Binary file removed assets/methods/hybrid_workflow.png
Binary file not shown.
Binary file removed assets/methods/kvbuffer_hybrid.png
Binary file not shown.
Binary file removed assets/methods/patchvaeconv.png
Binary file not shown.
Binary file removed assets/methods/xFuserSP.png
Binary file not shown.
Binary file removed assets/methods/xdit_method.png
Binary file not shown.
Binary file removed assets/methods/xdit_overview.png
Binary file not shown.
Binary file removed assets/overview.png
Binary file not shown.
Binary file removed assets/performance/flux/Flux-1K-A100.png
Binary file not shown.
Binary file removed assets/performance/flux/Flux-1k-L40.png
Binary file not shown.
Binary file removed assets/performance/flux/Flux-2K-A100.png
Binary file not shown.
Binary file removed assets/performance/flux/Flux-2k-L40.png
Binary file not shown.
Binary file removed assets/performance/flux/flux_a100.jpg
Binary file not shown.
Binary file removed assets/performance/flux/flux_image.png
Binary file not shown.
Binary file removed assets/performance/flux/flux_l40.jpg
Binary file not shown.
Binary file removed assets/performance/hunuyuandit/A100-HunyuanDiT.png
Binary file not shown.
Binary file removed assets/performance/hunuyuandit/L40-HunyuanDiT.png
Binary file not shown.
Binary file removed assets/performance/hunuyuandit/T4-HunyuanDiT.png
Diff not rendered.
Binary file removed assets/performance/hunuyuandit/V100-HunyuanDiT.png
Diff not rendered.
Binary file removed assets/performance/latte/Latte-L20-1024.png
Diff not rendered.
Binary file removed assets/performance/latte/Latte-L20-512.png
Diff not rendered.
Binary file removed assets/performance/sd3/A100-SD3.png
Diff not rendered.
Binary file removed assets/performance/sd3/L40-SD3.png
Diff not rendered.
Binary file removed assets/workflow.png
Diff not rendered.
2 changes: 1 addition & 1 deletion comfyui-xdit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,4 @@ python main.py

You can load the default workflow in the comfyui-xdit/workflows folder: xdit-comfyui-demo.json

![demo](https://raw.githubusercontent.com/xdit-project/comfyui/demo.png)
![demo](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/comfyui/demo.png)
2 changes: 1 addition & 1 deletion docs/developer/Manual_for_Adding_New_Models.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

The following diagram shows the calling relationships between different Classes in the xDiT project. If a new PixArt model is added, the added classes are circled in red.

![class_structure.png](../../assets/developer/class_structure.png)
![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)

# 1. Pipeline Class

Expand Down
2 changes: 1 addition & 1 deletion docs/developer/The_implement_design_of_xdit_framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ In the xDiT framework, a wrapper approach is used to modify the required classes

The organizational structure of model-related files in xDiT is as follows:

![class_structure.png](../../assets/developer/class_structure.png)
![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)

- All model classes inherit from the base class `xFuserBaseWrapper`, which provides basic features such as getattr and runtime condition checking.
- Four classes representing different model components inherit from `xFuserBaseWrapper`, including:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ xfuser/model_executor

xDiT中模型相关文件的组织结构如下:

![class_structure.png](../../assets/developer/class_structure.png)
![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)

- 所有模型类均会继承于基类`xFuserBaseWrapper`,用以提供基础的getattr和运行时条件检查等特性
- 四个代表不同模型组分的类分别继承于`xFuserBaseWrapper`,其中:
Expand Down
6 changes: 3 additions & 3 deletions docs/methods/hybrid.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ PipeFusion leverages the characteristic of Input Temporal Redundancy, using Stal
We elaborate on this issue with the following illustration, which shows a mixed parallel method with pipe_degree=4 and sp_degree=2. Setting `num_pipeline_patch`=4, the image is divided into M=`num_pipeline_patch*sp_degree`=8 patches, labeled P0~P7.

<div align="center">
<img src="../../assets/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
</div>

In the implementation of Standard SP Attention, the inputs Q, K, V, and the output O are all split along the sequence dimension, with consistent splitting pattern.
Expand All @@ -21,11 +21,11 @@ Within this diffusion step, device=0 cannot obtain the fresh KV of P1,3,5,7 for
Standard SP only has 1/sp_degree of the fresh KV buffer, so it cannot achieve the correct results for mixed parallel inference.

<div align="center">
<img src="../../assets/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
</div>

xDiT has customized the implementation of sequence parallelism to meet this mixed parallel requirement. xDiT uses `xFuserLongContextAttention` to store the intermediate results of SP in the KV Buffer. The effect is illustrated in the figure, where after each micro-step SP execution, the fresh KV of different rank devices within the SP Group is replicated. This way, after one diffusion step, the KV Buffer of all devices in the SP Group is updated to the latest, ready for use in the next Diffusion Step.

<div align="center">
<img src="../../assets/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
</div>
6 changes: 3 additions & 3 deletions docs/methods/hybrid_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ PipeFusion利用Input Tempor Redundancy特点,使用过时的KV(Stale KV)
我们对这个问题具体说明,下图展示了pipe_degree=4,sp_degree=2的混合并行方法。设置`num_pipeline_patch`=4,图片切分为M=`num_pipeline_patch*sp_degree`=8个Patch,分别是P0~P7。

<div align="center">
<img src="../../assets/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
</div>

Standard SP Attention实现,输入Q,K,V和输出O都是沿着序列维度切分,且切分方式一致。如果不同rank的输入patch没有重叠,每个micro step计算出fresh KV更新的位置在不同rank间也没有重叠。如下图所示,standard SP的KV Buffer中黄色部分是SP0 rank=0拥有的fresh KV,绿色部分是SP1 rank=1拥有的fresh KV,二者并不相同。在这个diffusion step内,device=0无法拿到P1,3,5,7的fresh KV进行计算,但是PipeFusion则需要在下一个diffusion step中,拥有上一个diffusion step全部的KV。standard SP只拥有1/sp_degree的fresh kv buffer,因此无法获得混合并行推理正确的结果。

<div align="center">
<img src="../../assets/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
</div>


Expand All @@ -25,5 +25,5 @@ xDiT专门定制了序列并行的实现方式,以适应这种混合并行的


<div align="center">
<img src="../../assets/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
</div>
2 changes: 1 addition & 1 deletion docs/methods/parallel_vae.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ To address this limitation, we developed [DistVAE](https://github.com/xdit-proje
For the convolutional operator in VAE, we require the communication of the halo region data of the image as shown in the following figures.

<div align="center">
<img src="../../assets/methods/patchvaeconv.png" alt="hybrid process group config" width="60%">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/patchvaeconv.png" alt="hybrid process group config" width="60%">
</div>

* Chunked Input Processing: Similar to [MIT-patch-conv](https://hanlab.mit.edu/blog/patch-conv), we split the input feature map into chunks and feed them into convolution operator sequentially. This approach minimizes temporary memory consumption.
Expand Down
6 changes: 3 additions & 3 deletions docs/methods/pipefusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ PipeFusion innovatively harnesses input temporal redundancy—the similarity bet
It significantly surpasses other methods in communication efficiency, particularly in multi-node setups connected via Ethernet and multi-GPU configurations linked with PCIe.

<div align="center">
<img src="../../assets/overview.png" alt="PipeFusion Image">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/overview.png" alt="PipeFusion Image">
</div>

The above picture compares DistriFusion and PipeFusion.
Expand All @@ -25,14 +25,14 @@ Each device processes the computation task for one patch of its assigned stage i
The PipeFusion pipeline workflow when $M$ = $N$ =4 is shown in the following picture.

<div align="center">
<img src="../../assets/workflow.png" alt="Pipeline Image">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/workflow.png" alt="Pipeline Image">
</div>


We have evaluated the accuracy of PipeFusion, DistriFusion and the baseline as shown bolow. To conduct the FID experiment, follow the detailed instructions provided in the [documentation](../../docs/fid/FID.md).

<div align="center">
<img src="../../assets/image_quality.png" alt="image_quality">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/image_quality.png" alt="image_quality">
</div>


Expand Down
20 changes: 20 additions & 0 deletions docs/performance/cogvideo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## CogVideo Performance
[Chinese Version](./cogvideo_zh.md)

CogVideo functions as a text-to-video model. xDiT presently integrates USP techniques (including Ulysses attention and Ring attention) and CFG parallelism to enhance inference speed, while work on PipeFusion is ongoing. Due to constraints in video generation dimensions in CogVideo, the maximum parallelism level for USP is 2. Thus, xDiT can leverage up to 4 GPUs to execute CogVideo, despite the potential for additional GPUs within the machine.

On a machine with L40 (PCIe) GPUs, we test the inference latency for generating a video with 30 frames, 720px with and 480px height with various DiT models.

The results for the CogVideoX-2b model are depicted in the following figure. As we can see, the latency decreases as the degree of parallelism grows. And xDiT achieves an up to 3.1X speedup over the original inference implementation in the `diffusers` package.

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png"
alt="latency-cogvideo-l40-2b">
</div>

Similarly, as for the CogVideoX-5b model, xDiT achieves an up to 3.9X speedup.

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png"
alt="latency-cogvideo-l40-5b">
</div>
19 changes: 19 additions & 0 deletions docs/performance/cogvideo_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## CogVideo 性能表现

CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术(包括 Ulysses 注意力和 Ring 注意力)和 CFG 并行来提高推理速度,同时 PipeFusion 的工作正在进行中。由于 CogVideo 在视频生成尺寸上的限制,USP 的最大并行级别为 2。因此,xDiT 可以利用最多 4 个 GPU 来执行 CogVideo,尽管机器内可能有更多的 GPU。

在一台配备 L40(PCIe)GPU 的机器上,我们测试了使用不同 DiT 模型生成具有 30 帧、720px 宽和 480px 高的视频的推理延迟。

CogVideoX-2b 模型的结果显示在下图中。我们可以看到,随着并行度的增加,延迟有效减少。而且 xDiT 具有相较于 diffusers 软件包中的原始推理最多 3.1 倍的加速。

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png"
alt="latency-cogvideo-l40-2b">
</div>

同样地,对于 CogVideoX-5b 模型,xDiT 实现了最多 3.9 倍的加速。

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png"
alt="latency-cogvideo-l40-5b">
</div>
Loading

0 comments on commit 9675e39

Please sign in to comment.