Fix inaccurate comment in `parallel_state`. #473

c8ef · 2025-03-06T16:23:52Z

Fixes: #472.

c8ef · 2025-03-06T16:27:44Z

I'm not sure if this is the right approach. Could you please take a look? Thanks! @feifeibear

feifeibear · 2025-03-07T01:58:41Z

xfuser/core/distributed/parallel_state.py

@@ -338,12 +337,12 @@ def initialize_model_parallel(

    dp_degree (2) * cfg_degree (2) * sp_degree (2) * pp_degree (2) = 16.

-    The present function will create 2 data parallel-groups,
+    The present function will create 8 data-parallel groups,


The original description is correct. It is 2 DP groups. The degree means the number of processes in a group.

Are the CFG, PP, and SP groups also two-degree? Should I update those as well?

I see your confusion. The dp degree is special here. It means the number of groups.

The degree means the nccl communication group size.

from xfuser.core.distributed.utils import RankGenerator rank = RankGenerator(tp=1, dp=2, cfg=2, sp=2, pp=2, order="tp-sp-pp-cfg-dp") print("dp :", rank.get_ranks("dp")) print("cfg:", rank.get_ranks("cfg")) print("sp :", rank.get_ranks("sp")) print("pp :", rank.get_ranks("pp")) # dp : [[0, 8], [1, 9], [2, 10], [3, 11], [4, 12], [5, 13], [6, 14], [7, 15]] # cfg: [[0, 4], [1, 5], [2, 6], [3, 7], [8, 12], [9, 13], [10, 14], [11, 15]] # sp : [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]] # pp : [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]

The code above produces ranks arranged as [[0, 8], [1, 9], [2, 10], [3, 11], [4, 12], [5, 13], [6, 14], [7, 15]], but the current comment suggests a different arrangement: [g0, g1, g2, g3, g4, g5, g6, g7], [g8, g9, g10, g11, g12, g13, g14, g15]. I'm not sure which is correct. If it's the former arrangement, does that still mean we have 8 groups?

the code is correct. Would you like to update the MR w.r.t the real outputs?

I'm still a bit confused. If the dp degree refers to the number of groups, the code's output won't match the current comment. If the dp degree refers to the NCCL group size, then the current MR will be sufficient. Which part do I need to revise: the description above, the output below, or both?

fix comment & style

6162cc5

feifeibear requested changes Mar 7, 2025

View reviewed changes

c8ef requested a review from feifeibear March 7, 2025 15:43

feifeibear approved these changes Mar 9, 2025

View reviewed changes

feifeibear merged commit 124c822 into xdit-project:main Mar 9, 2025

c8ef deleted the patch-1 branch March 9, 2025 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inaccurate comment in `parallel_state`. #473

Fix inaccurate comment in `parallel_state`. #473

c8ef commented Mar 6, 2025

c8ef commented Mar 6, 2025

feifeibear Mar 7, 2025

c8ef Mar 7, 2025

feifeibear Mar 7, 2025 •

edited

Loading

c8ef Mar 7, 2025

feifeibear Mar 7, 2025

c8ef Mar 7, 2025

Fix inaccurate comment in parallel_state. #473

Fix inaccurate comment in parallel_state. #473

Conversation

c8ef commented Mar 6, 2025

c8ef commented Mar 6, 2025

feifeibear Mar 7, 2025

Choose a reason for hiding this comment

c8ef Mar 7, 2025

Choose a reason for hiding this comment

feifeibear Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

c8ef Mar 7, 2025

Choose a reason for hiding this comment

feifeibear Mar 7, 2025

Choose a reason for hiding this comment

c8ef Mar 7, 2025

Choose a reason for hiding this comment

Fix inaccurate comment in `parallel_state`. #473

Fix inaccurate comment in `parallel_state`. #473

feifeibear Mar 7, 2025 •

edited

Loading