update docs and remove figure (#287)

xdit-project · Sep 24, 2024 · 9675e39 · 9675e39
1 parent 6cf25eb
commit 9675e39
Show file tree

Hide file tree

Showing 54 changed files with 142 additions and 103 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -23,20 +23,20 @@ jobs:
     continue-on-error: true
     steps:
       - name: unzip
-        run: rm -rf ~/xDiT
-      - run: mkdir ~/xDiT
-      - run: unzip ~/xDiT.zip -d ~/xDiT
+        run: rm -rf ~/xDiT_${{github.run_number}}
+      - run: mkdir ~/xDiT_${{github.run_number}}
+      - run: unzip ~/xDiT.zip -d ~/xDiT_${{github.run_number}}
       - name: Setup docker
-        run: docker run --rm --name xfuser_test_docker_${{github.repository_owner_id}} -d -i -t --runtime=nvidia --gpus all -v /cfs:/cfs -v /mnt:/mnt -v ~/xDiT:/code xfuser_cicd/test-py_3_11-torch_2_4_1 /bin/bash
+        run: docker run --rm --name xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} -d -i -t --runtime=nvidia --gpus all -v /cfs:/cfs -v /mnt:/mnt -v ~/xDiT_${{github.run_number}}:/code xfuser_cicd/test-py_3_11-torch_2_4_1 /bin/bash
       - name: Install xfuser
-        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} pip3.11 install -e .
+        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} pip3.11 install -e .
       - name: Test xfuser
-        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} sh -c "torchrun --nproc_per_node=8 ./examples/sd3_example.py --model /cfs/dit/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1 --height 1024 --width 1024 --no_use_resolution_binning --num_inference_steps 20 --warmup_steps 0 --prompt 'A small dog' --use_cfg_parallel"
+        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} sh -c "torchrun --nproc_per_node=8 ./examples/sd3_example.py --model /cfs/dit/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1 --height 1024 --width 1024 --no_use_resolution_binning --num_inference_steps 20 --warmup_steps 0 --prompt 'A small dog' --use_cfg_parallel"
   clear-env:
     needs: setup-env-and-test
     runs-on: [self-hosted, linux, x64]
     steps:
       - name: Remove Files
-        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}} sh -c "rm -r *"
+        run: docker exec -w /code xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}} sh -c "rm -r *"
       - name: Destroy docker
-        run: docker stop xfuser_test_docker_${{github.repository_owner_id}}
+        run: docker stop xfuser_test_docker_${{github.repository_owner_id}}_${{github.run_number}}
diff --git a/README.md b/README.md
@@ -4,7 +4,6 @@
 
   <picture>
     <img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/XDiTlogo.png" width="50%">
-  </picture>
 
   </p>
   <h3>A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters</h3>
@@ -76,13 +75,13 @@ In addition to utilizing well-known Attention optimization libraries, we leverag
 The overview of xDiT is shown as follows.
 
 <picture>
-  <img alt="xDiT" src="./assets/methods/xdit_overview.png">
+  <img alt="xDiT" src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_overview.png">
 </picture>
 
 
 <h2 id="updates">📢 Updates</h2>
 
-* ⚙️**August 30, 2024**: Supporting(WIP) CogVideoX. The inference scripts are [examples/latte_example](examples/cogvideox_example.py).
+* 🎉**September 23, 2024**: Support CogVideoX. The inference scripts are [examples/cogvideox_example](examples/cogvideox_example.py).
 * 🎉**August 26, 2024**: We apply torch.compile and [onediff](https://github.com/siliconflow/onediff) nexfort backend to accelerate GPU kernels speed.
 * 🎉**August 9, 2024**: Support Latte sequence parallel version. The inference scripts are [examples/latte_example](examples/latte_example.py).
 * 🎉**August 8, 2024**: Support Flux sequence parallel version. The inference scripts are [examples/flux_example](examples/flux_example.py).
@@ -100,7 +99,7 @@ The overview of xDiT is shown as follows.
 
 | Model Name | CFG | SP | PipeFusion |
 | --- | --- | --- | --- |
-| [🎬 CogVideoX](https://huggingface.co/THUDM/CogVideoX-2b) | ❎ | ❎ | ❎ | 
+| [🎬 CogVideoX](https://huggingface.co/THUDM/CogVideoX-2b) | ✔️ | ✔️ | ❎ | 
 | [🎬 Latte](https://huggingface.co/maxin-cn/Latte-1) | ❎ | ✔️ | ❎ | 
 | [🔵 HunyuanDiT-v1.2-Diffusers](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers) | ✔️ | ✔️ | ✔️ |
 | [🟠 Flux](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | NA | ✔️ | ❎ |
@@ -119,25 +118,29 @@ The overview of xDiT is shown as follows.
 
 <h2 id="perf">📈 Performance</h2>
 
+<h3 id="perf_flux">CogVideo</h3>
+
+1. [CogVideo Performance Report](./docs/performance/cogvideo.md)
+
 <h3 id="perf_flux">Flux.1</h3>
 
-1. [Flux Performance Report](./docs/performance/flux.md)
+2. [Flux Performance Report](./docs/performance/flux.md)
+
+<h3 id="perf_latte">Latte</h3>
+
+3. [Latte Performance Report](./docs/performance/latte.md)
 
 <h3 id="perf_hunyuandit">HunyuanDiT</h3>
 
-2. [HunyuanDiT Performance Report](./docs/performance/hunyuandit.md)
+4. [HunyuanDiT Performance Report](./docs/performance/hunyuandit.md)
 
 <h3 id="perf_sd3">SD3</h3>
 
-3. [Stable Diffusion 3 Performance Report](./docs/performance/sd3.md)
+5. [Stable Diffusion 3 Performance Report](./docs/performance/sd3.md)
 
 <h3 id="perf_pixart">Pixart</h3>
 
-4. [Pixart-Alpha Performance Report (legacy)](./docs/performance/pixart_alpha_legacy.md)
-
-<h3 id="perf_latte">Pixart</h3>
-
-5. [Latte Performance Report](./docs/performance/latte.md)
+6. [Pixart-Alpha Performance Report (legacy)](./docs/performance/pixart_alpha_legacy.md)
 
 
 <h2 id="QuickStart">🚀 QuickStart</h2>
@@ -282,34 +285,32 @@ For the VAE module, xDiT offers a parallel implementation, [DistVAE](https://git
 The (<span style="color: red;">xDiT</span>) highlights the methods first proposed by use.
 
 <div align="center">
-    <img src="assets/methods/xdit_method.png" alt="xdit methods">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/xdit_method.png" alt="xdit methods">
 </div>
 
 The communication and memory costs associated with the aforementioned intra-image parallelism, except for the CFG and DP (they are inter-image parallel), in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)
 
 As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.
 
-𝒑: Number of pixels;
-𝒉𝒔: Model hidden size;
-𝑳: Number of model layers;
-𝑷: Total model parameters;
-𝑵: Number of parallel devices;
-𝑴: Number of patch splits;
-𝑸𝑶: Query and Output parameter count;
-𝑲𝑽: KV Activation parameter count;
+𝒑: Number of pixels;\
+𝒉𝒔: Model hidden size;\
+𝑳: Number of model layers;\
+𝑷: Total model parameters;\
+𝑵: Number of parallel devices;\
+𝑴: Number of patch splits;\
+𝑸𝑶: Query and Output parameter count;\
+𝑲𝑽: KV Activation parameter count;\
 𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;
 
-<div align="center">
 
-|          | attn-KV | communication cost | param memory | activations memory | extra buff memory |
-|:--------:|:-------:|:-----------------:|:-----:|:-----------:|:----------:|
-| Tensor Parallel | fresh | $4O(p \times hs)L$ | $\frac{1}{N}P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
-| DistriFusion* | stale | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $2AL = (KV)L$ |
-| Ring Sequence Parallel* | fresh | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
-| Ulysses Sequence Parallel | fresh | $\frac{4}{N}O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
-| PipeFusion* | stale- | $2O(p \times hs)$ | $\frac{1}{N}P$ | $\frac{2}{M}A = \frac{1}{M}QO$ | $\frac{2L}{N}A = \frac{1}{N}(KV)L$ |
+|                           | attn-KV | communication cost           | param memory   | activations memory             | extra buff memory                  |
+|:-------------------------:|:-------:|:----------------------------:|:--------------:|:------------------------------:|:----------------------------------:|
+| Tensor Parallel           | fresh   | $4O(p \times hs)L$           | $\frac{1}{N}P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$     |
+| DistriFusion*             | stale   | $2O(p \times hs)L$           | $P$            | $\frac{2}{N}A = \frac{1}{N}QO$ | $2AL = (KV)L$                      |
+| Ring Sequence Parallel*   | fresh   | $2O(p \times hs)L$           | $P$            | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$     |
+| Ulysses Sequence Parallel | fresh   | $\frac{4}{N}O(p \times hs)L$ | $P$            | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$     |
+| PipeFusion*               | stale-  | $2O(p \times hs)$            | $\frac{1}{N}P$ | $\frac{2}{M}A = \frac{1}{M}QO$ | $\frac{2L}{N}A = \frac{1}{N}(KV)L$ |
 
-</div>
 
 <h4 id="PipeFusion">1.1. PipeFusion</h4>
 

diff --git a/assets/XDiTlogo.png b/assets/XDiTlogo.png
diff --git a/assets/developer/class_structure.png b/assets/developer/class_structure.png
diff --git a/assets/image_quality.png b/assets/image_quality.png
diff --git a/assets/latency-A100-NVLink.png b/assets/latency-A100-NVLink.png
diff --git a/assets/latency-A100-PCIe.png b/assets/latency-A100-PCIe.png
diff --git a/assets/latency-L20.jpg b/assets/latency-L20.jpg
diff --git a/assets/latency-L20.png b/assets/latency-L20.png
diff --git a/assets/latency-T4.png b/assets/latency-T4.png
diff --git a/assets/methods/hybrid_pp_scheme.png b/assets/methods/hybrid_pp_scheme.png
diff --git a/assets/methods/hybrid_workflow.png b/assets/methods/hybrid_workflow.png
diff --git a/assets/methods/kvbuffer_hybrid.png b/assets/methods/kvbuffer_hybrid.png
diff --git a/assets/methods/patchvaeconv.png b/assets/methods/patchvaeconv.png
diff --git a/assets/methods/xFuserSP.png b/assets/methods/xFuserSP.png
diff --git a/assets/methods/xdit_method.png b/assets/methods/xdit_method.png
diff --git a/assets/methods/xdit_overview.png b/assets/methods/xdit_overview.png
diff --git a/assets/overview.png b/assets/overview.png
diff --git a/assets/performance/flux/Flux-1K-A100.png b/assets/performance/flux/Flux-1K-A100.png
diff --git a/assets/performance/flux/Flux-1k-L40.png b/assets/performance/flux/Flux-1k-L40.png
diff --git a/assets/performance/flux/Flux-2K-A100.png b/assets/performance/flux/Flux-2K-A100.png
diff --git a/assets/performance/flux/Flux-2k-L40.png b/assets/performance/flux/Flux-2k-L40.png
diff --git a/assets/performance/flux/flux_a100.jpg b/assets/performance/flux/flux_a100.jpg
diff --git a/assets/performance/flux/flux_image.png b/assets/performance/flux/flux_image.png
diff --git a/assets/performance/flux/flux_l40.jpg b/assets/performance/flux/flux_l40.jpg
diff --git a/assets/performance/hunuyuandit/A100-HunyuanDiT.png b/assets/performance/hunuyuandit/A100-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/L40-HunyuanDiT.png b/assets/performance/hunuyuandit/L40-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/T4-HunyuanDiT.png b/assets/performance/hunuyuandit/T4-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/V100-HunyuanDiT.png b/assets/performance/hunuyuandit/V100-HunyuanDiT.png
diff --git a/assets/performance/latte/Latte-L20-1024.png b/assets/performance/latte/Latte-L20-1024.png
diff --git a/assets/performance/latte/Latte-L20-512.png b/assets/performance/latte/Latte-L20-512.png
diff --git a/assets/performance/sd3/A100-SD3.png b/assets/performance/sd3/A100-SD3.png
diff --git a/assets/performance/sd3/L40-SD3.png b/assets/performance/sd3/L40-SD3.png
diff --git a/assets/workflow.png b/assets/workflow.png
diff --git a/comfyui-xdit/README.md b/comfyui-xdit/README.md
@@ -55,4 +55,4 @@ python main.py
 
 You can load the default workflow in the comfyui-xdit/workflows folder: xdit-comfyui-demo.json
 
-![demo](https://raw.githubusercontent.com/xdit-project/comfyui/demo.png)
+![demo](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/comfyui/demo.png)
diff --git a/docs/developer/Manual_for_Adding_New_Models.md b/docs/developer/Manual_for_Adding_New_Models.md
@@ -14,7 +14,7 @@
 
 The following diagram shows the calling relationships between different Classes in the xDiT project. If a new PixArt model is added, the added classes are circled in red.
 
-![class_structure.png](../../assets/developer/class_structure.png)
+![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)
 
 # 1. Pipeline Class
 

diff --git a/docs/developer/The_implement_design_of_xdit_framework.md b/docs/developer/The_implement_design_of_xdit_framework.md
@@ -50,7 +50,7 @@ In the xDiT framework, a wrapper approach is used to modify the required classes
 
 The organizational structure of model-related files in xDiT is as follows:
 
-![class_structure.png](../../assets/developer/class_structure.png)
+![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)
 
 - All model classes inherit from the base class `xFuserBaseWrapper`, which provides basic features such as getattr and runtime condition checking.
 - Four classes representing different model components inherit from `xFuserBaseWrapper`, including:

diff --git a/docs/developer/The_implement_design_of_xdit_framework_zh.md b/docs/developer/The_implement_design_of_xdit_framework_zh.md
@@ -48,7 +48,7 @@ xfuser/model_executor
 
 xDiT中模型相关文件的组织结构如下：
 
-![class_structure.png](../../assets/developer/class_structure.png)
+![class_structure.png](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/developer/class_structure.png)
 
 - 所有模型类均会继承于基类`xFuserBaseWrapper`，用以提供基础的getattr和运行时条件检查等特性
 - 四个代表不同模型组分的类分别继承于`xFuserBaseWrapper`，其中：

diff --git a/docs/methods/hybrid.md b/docs/methods/hybrid.md
@@ -11,7 +11,7 @@ PipeFusion leverages the characteristic of Input Temporal Redundancy, using Stal
 We elaborate on this issue with the following illustration, which shows a mixed parallel method with pipe_degree=4 and sp_degree=2. Setting `num_pipeline_patch`=4, the image is divided into M=`num_pipeline_patch*sp_degree`=8 patches, labeled P0~P7.
 
 <div align="center">
-    <img src="../../assets/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_pp_scheme.png" alt="hybrid process group config" width="60%">
 </div>
 
 In the implementation of Standard SP Attention, the inputs Q, K, V, and the output O are all split along the sequence dimension, with consistent splitting pattern. 
@@ -21,11 +21,11 @@ Within this diffusion step, device=0 cannot obtain the fresh KV of P1,3,5,7 for
 Standard SP only has 1/sp_degree of the fresh KV buffer, so it cannot achieve the correct results for mixed parallel inference.
 
 <div align="center">
-    <img src="../../assets/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
 </div>
 
 xDiT has customized the implementation of sequence parallelism to meet this mixed parallel requirement. xDiT uses `xFuserLongContextAttention` to store the intermediate results of SP in the KV Buffer. The effect is illustrated in the figure, where after each micro-step SP execution, the fresh KV of different rank devices within the SP Group is replicated. This way, after one diffusion step, the KV Buffer of all devices in the SP Group is updated to the latest, ready for use in the next Diffusion Step.
 
 <div align="center">
-    <img src="../../assets/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
 </div>
diff --git a/docs/methods/hybrid_zh.md b/docs/methods/hybrid_zh.md
@@ -10,13 +10,13 @@ PipeFusion利用Input Tempor Redundancy特点，使用过时的KV（Stale KV）
 我们对这个问题具体说明，下图展示了pipe_degree=4，sp_degree=2的混合并行方法。设置`num_pipeline_patch`=4，图片切分为M=`num_pipeline_patch*sp_degree`=8个Patch，分别是P0~P7。
 
 <div align="center">
-    <img src="../../assets/methods/hybrid_pp_scheme.png" alt="hybrid process group config"  width="60%">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_pp_scheme.png" alt="hybrid process group config"  width="60%">
 </div>
 
 Standard SP Attention实现，输入Q，K，V和输出O都是沿着序列维度切分，且切分方式一致。如果不同rank的输入patch没有重叠，每个micro step计算出fresh KV更新的位置在不同rank间也没有重叠。如下图所示，standard SP的KV Buffer中黄色部分是SP0 rank=0拥有的fresh KV，绿色部分是SP1 rank=1拥有的fresh KV，二者并不相同。在这个diffusion step内，device=0无法拿到P1,3,5,7的fresh KV进行计算，但是PipeFusion则需要在下一个diffusion step中，拥有上一个diffusion step全部的KV。standard SP只拥有1/sp_degree的fresh kv buffer，因此无法获得混合并行推理正确的结果。
 
 <div align="center">
-    <img src="../../assets/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/hybrid_workflow.png" alt="hybrid parallel workflow">
 </div>
 
 
@@ -25,5 +25,5 @@ xDiT专门定制了序列并行的实现方式，以适应这种混合并行的
 
 
 <div align="center">
-    <img src="../../assets/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/kvbuffer_hybrid.png" alt="kvbuffer in hybrid parallel">
 </div>
diff --git a/docs/methods/parallel_vae.md b/docs/methods/parallel_vae.md
@@ -8,7 +8,7 @@ To address this limitation, we developed [DistVAE](https://github.com/xdit-proje
 For the convolutional operator in VAE, we require the communication of the halo region data of the image as shown in the following figures.
 
 <div align="center">
-    <img src="../../assets/methods/patchvaeconv.png" alt="hybrid process group config" width="60%">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/methods/patchvaeconv.png" alt="hybrid process group config" width="60%">
 </div>
 
 * Chunked Input Processing: Similar to [MIT-patch-conv](https://hanlab.mit.edu/blog/patch-conv), we split the input feature map into chunks and feed them into convolution operator sequentially. This approach minimizes temporary memory consumption.

diff --git a/docs/methods/pipefusion.md b/docs/methods/pipefusion.md
@@ -8,7 +8,7 @@ PipeFusion innovatively harnesses input temporal redundancy—the similarity bet
 It significantly surpasses other methods in communication efficiency, particularly in multi-node setups connected via Ethernet and multi-GPU configurations linked with PCIe.
 
 <div align="center">
-    <img src="../../assets/overview.png" alt="PipeFusion Image">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/overview.png" alt="PipeFusion Image">
 </div>
 
 The above picture compares DistriFusion and PipeFusion.
@@ -25,14 +25,14 @@ Each device processes the computation task for one patch of its assigned stage i
 The PipeFusion pipeline workflow when $M$ = $N$ =4 is shown in the following picture.
 
 <div align="center">
-    <img src="../../assets/workflow.png" alt="Pipeline Image">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/workflow.png" alt="Pipeline Image">
 </div>
 
 
 We have evaluated the accuracy of PipeFusion, DistriFusion and the baseline as shown bolow. To conduct the FID experiment, follow the detailed instructions provided in the [documentation](../../docs/fid/FID.md).
 
 <div align="center">
-    <img src="../../assets/image_quality.png" alt="image_quality">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/image_quality.png" alt="image_quality">
 </div>
 
 

diff --git a/docs/performance/cogvideo.md b/docs/performance/cogvideo.md
@@ -0,0 +1,20 @@
+## CogVideo Performance
+[Chinese Version](./cogvideo_zh.md)
+
+CogVideo functions as a text-to-video model. xDiT presently integrates USP techniques (including Ulysses attention and Ring attention) and CFG parallelism to enhance inference speed, while work on PipeFusion is ongoing. Due to constraints in video generation dimensions in CogVideo, the maximum parallelism level for USP is 2. Thus, xDiT can leverage up to 4 GPUs to execute CogVideo, despite the potential for additional GPUs within the machine.
+
+On a machine with L40 (PCIe) GPUs, we test the inference latency for generating a video with 30 frames, 720px with and 480px height with various DiT models.
+
+The results for the CogVideoX-2b model are depicted in the following figure. As we can see, the latency decreases as the degree of parallelism grows. And xDiT achieves an up to 3.1X speedup over the original inference implementation in the `diffusers` package.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
+    alt="latency-cogvideo-l40-2b">
+</div>
+
+Similarly, as for the CogVideoX-5b model, xDiT achieves an up to 3.9X speedup.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
+    alt="latency-cogvideo-l40-5b">
+</div>
diff --git a/docs/performance/cogvideo_zh.md b/docs/performance/cogvideo_zh.md
@@ -0,0 +1,19 @@
+## CogVideo 性能表现
+
+CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。由于 CogVideo 在视频生成尺寸上的限制，USP 的最大并行级别为 2。因此，xDiT 可以利用最多 4 个 GPU 来执行 CogVideo，尽管机器内可能有更多的 GPU。
+
+在一台配备 L40（PCIe）GPU 的机器上，我们测试了使用不同 DiT 模型生成具有 30 帧、720px 宽和 480px 高的视频的推理延迟。
+
+CogVideoX-2b 模型的结果显示在下图中。我们可以看到，随着并行度的增加，延迟有效减少。而且 xDiT 具有相较于 diffusers 软件包中的原始推理最多 3.1 倍的加速。
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
+    alt="latency-cogvideo-l40-2b">
+</div>
+
+同样地，对于 CogVideoX-5b 模型，xDiT 实现了最多 3.9 倍的加速。
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
+    alt="latency-cogvideo-l40-5b">
+</div>
Original file line number	Diff line number	Diff line change
Expand Up		@@ -55,4 +55,4 @@ python main.py

		You can load the default workflow in the comfyui-xdit/workflows folder: xdit-comfyui-demo.json

		![demo](https://raw.githubusercontent.com/xdit-project/comfyui/demo.png)
		![demo](https://raw.githubusercontent.com/xdit-project/xdit_assets/main/comfyui/demo.png)