Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Tensor Parallelism to the Step-Video-T2V model #454

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

LiaoYuanF
Copy link
Contributor

@LiaoYuanF LiaoYuanF commented Feb 25, 2025

为Custom Model(Step-Video-T2V)添加Tensor Parallelism支持

实现方案

通过下列改进在SelfAttention、CrossAttention和FFN模块中实现TP支持:

  1. 参数切分:优化跨设备权重初始化的内存对齐策略
  2. 梯度同步:采用分阶段同步策略降低通信开销

基准测试(Nvidia H20平台)

模块内存分析(基于bf16)

组件 类名 参数规模 显存占用
attn1 SelfAttention 150,995K 13.44GB
attn2 CrossAttention 150,995K 13.44GB
ff FeedForward 301,990K 26.88GB

并行效率对比

总卡数 并行策略 并行维度 单迭代时长 加速比 显存占用
1 基准线 TP1 SP1 213.60s 1.00x 92,170M
2 TP TP2 (Self+Cross+FFN) 108.97s 0.98x 57,458M (-37.7%)
2 SP SP2 108.13s 0.99x 86,258M (-6.4%)
4 TP TP4 (Self+Cross+FFN) 57.61s 0.93x 36,566M (-60.3%)
4 SP SP4 57.01s 0.94x 78,226M (-15.1%)
8 TP TP8 (Self+Cross+FFN) 30.40s 0.88x 30,028M (-67.4%)
8 SP SP8 30.10s 0.89x 79,684M (-13.5%)

维度说明

  1. TP维度:包含SelfAttention/CrossAttention/FFN层的多维切分
  2. SP维度:序列并行切分维度
  3. 显存优化率括号内为对比基准线的下降百分比

核心价值

  • 显存优化:TP8配置下显存节省达54%(对比SP8)
  • 硬件适配
    • 消费级显卡:完整支持32GB*8设备推理
    • 推理卡:完整支持48GB*4设备推理

@feifeibear feifeibear merged commit 6875fca into xdit-project:main Feb 25, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants