Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inaccurate comment in parallel_state. #473

Merged
merged 1 commit into from
Mar 9, 2025

Conversation

c8ef
Copy link
Contributor

@c8ef c8ef commented Mar 6, 2025

Fixes: #472.

@c8ef
Copy link
Contributor Author

c8ef commented Mar 6, 2025

I'm not sure if this is the right approach. Could you please take a look? Thanks! @feifeibear

@@ -338,12 +337,12 @@ def initialize_model_parallel(
dp_degree (2) * cfg_degree (2) * sp_degree (2) * pp_degree (2) = 16.
The present function will create 2 data parallel-groups,
The present function will create 8 data-parallel groups,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original description is correct. It is 2 DP groups. The degree means the number of processes in a group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the CFG, PP, and SP groups also two-degree? Should I update those as well?

Copy link
Collaborator

@feifeibear feifeibear Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your confusion. The dp degree is special here. It means the number of groups.

The degree means the nccl communication group size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from xfuser.core.distributed.utils import RankGenerator

rank = RankGenerator(tp=1, dp=2, cfg=2, sp=2, pp=2, order="tp-sp-pp-cfg-dp")
print("dp :", rank.get_ranks("dp"))
print("cfg:", rank.get_ranks("cfg"))
print("sp :", rank.get_ranks("sp"))
print("pp :", rank.get_ranks("pp"))

# dp : [[0, 8], [1, 9], [2, 10], [3, 11], [4, 12], [5, 13], [6, 14], [7, 15]]
# cfg: [[0, 4], [1, 5], [2, 6], [3, 7], [8, 12], [9, 13], [10, 14], [11, 15]]
# sp : [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
# pp : [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]

The code above produces ranks arranged as [[0, 8], [1, 9], [2, 10], [3, 11], [4, 12], [5, 13], [6, 14], [7, 15]], but the current comment suggests a different arrangement: [g0, g1, g2, g3, g4, g5, g6, g7], [g8, g9, g10, g11, g12, g13, g14, g15]. I'm not sure which is correct. If it's the former arrangement, does that still mean we have 8 groups?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is correct. Would you like to update the MR w.r.t the real outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused. If the dp degree refers to the number of groups, the code's output won't match the current comment. If the dp degree refers to the NCCL group size, then the current MR will be sufficient. Which part do I need to revise: the description above, the output below, or both?

@c8ef c8ef requested a review from feifeibear March 7, 2025 15:43
@feifeibear feifeibear merged commit 124c822 into xdit-project:main Mar 9, 2025
@c8ef c8ef deleted the patch-1 branch March 9, 2025 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comment in parallel_state is inaccurate.
2 participants