Add CogVideoX results on A100 (#305)

xdit-project · Oct 12, 2024 · a354858 · a354858
1 parent 3bc5582
commit a354858
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 6 deletions.
diff --git a/docs/performance/cogvideo.md b/docs/performance/cogvideo.md
@@ -3,18 +3,25 @@
 
 CogVideo functions as a text-to-video model. xDiT presently integrates USP techniques (including Ulysses attention and Ring attention) and CFG parallelism to enhance inference speed, while work on PipeFusion is ongoing. Due to constraints in video generation dimensions in CogVideo, the maximum parallelism level for USP is 2. Thus, xDiT can leverage up to 4 GPUs to execute CogVideo, despite the potential for additional GPUs within the machine.
 
-On a machine with L40 (PCIe) GPUs, we test the inference latency for generating a video with 30 frames, 720px with and 480px height with various DiT models.
+In a system equipped with L40 (PCIe) GPUs, we compared the inference performance of single-GPU CogVideoX utilizing the `diffusers` library with our parallelized versions for generating 49-frame (6-second) 720x480 videos.
 
-The results for the CogVideoX-2b model are depicted in the following figure. As we can see, the latency decreases as the degree of parallelism grows. And xDiT achieves an up to 3.1X speedup over the original inference implementation in the `diffusers` package.
+As depicted in the figure, across the baseline model CogVideoX-2b, inference latency reductions were observed when employing Ulysses Attention, Ring Attention, or CFG parallelism. Notably, CFG parallelism demonstrated superior performance due to its lower communication overhead. By combining sequence parallelism with CFG parallelism, we further enhanced inference efficiency. As the degree of parallelism increased, the latency consistently decreased. Under optimal settings, xDiT achieved a 3.53x speedup over single-GPU inference, reducing each iteration to 0.6 seconds. Given CogVideoX's default 50 iterations, a 6-second video can be generated end-to-end within 30 seconds. 
 
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
     alt="latency-cogvideo-l40-2b">
 </div>
 
-Similarly, as for the CogVideoX-5b model, xDiT achieves an up to 3.9X speedup.
+For the more complex CogVideoX-5b model, which incorporates additional parameters for improved video quality and visual effects, albeit with increased computational costs, similar performance trends were maintained. However, the acceleration ratio of the parallel versions was further enhanced. In comparison to the single-GPU version, xDiT attained a speedup of up to 3.91x, enabling end-to-end video generation in just over 80 seconds.
 
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
     alt="latency-cogvideo-l40-5b">
+</div>
+
+Similarly, on systems equipped with A100 devices, xDiT exhibited comparable acceleration ratios.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
+    alt="latency-cogvideo-a100-5b">
 </div>
diff --git a/docs/performance/cogvideo_zh.md b/docs/performance/cogvideo_zh.md
@@ -2,18 +2,25 @@
 
 CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。由于 CogVideo 在视频生成尺寸上的限制，USP 的最大并行级别为 2。因此，xDiT 可以利用最多 4 个 GPU 来执行 CogVideo，尽管机器内可能有更多的 GPU。
 
-在一台配备 L40（PCIe）GPU 的机器上，我们测试了使用不同 DiT 模型生成具有 30 帧、720px 宽和 480px 高的视频的推理延迟。
+在配备 L40（PCIe）GPU 的计算平台上，我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。
 
-CogVideoX-2b 模型的结果显示在下图中。我们可以看到，随着并行度的增加，延迟有效减少。而且 xDiT 具有相较于 diffusers 软件包中的原始推理最多 3.1 倍的加速。
+如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 3.53 倍的加速，使得每次迭代仅需 0.6 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 6 秒视频的端到端生成。
 
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
     alt="latency-cogvideo-l40-2b">
 </div>
 
-同样地，对于 CogVideoX-5b 模型，xDiT 实现了最多 3.9 倍的加速。
+对于更复杂的 CogVideoX-5b 模型，虽然其增加了参数以提升视频质量和视觉效果，导致计算成本显著增加，但所有方法在该模型上仍保持了与 CogVideoX-2b 相似的性能趋势，且并行版本的加速比进一步提升。与单GPU版本相比，xDiT 实现了高达 3.91 倍的推理速度提升，将端到端视频生成时间缩短至 80 秒左右。
 
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
     alt="latency-cogvideo-l40-5b">
+</div>
+
+同样，在配备 A100 GPU 的系统上，xDiT 也展示了类似的加速效果。
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
+    alt="latency-cogvideo-a100-5b">
 </div>