[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665
Draft
[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665
Conversation
…p2p cp_comm_type Prevent silent training degradation when cp_comm_type='a2a+p2p' is used without hierarchical_context_parallel_sizes. Without this validation, context parallel communication is silently disabled — each CP rank attends only to its local chunk, producing artificially high throughput but broken training. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…ore submodule Revert the Megatron-LM submodule bump and move the hierarchical_context_parallel_sizes / cp_comm_type validations into ConfigContainer._validate_cp_comm_type() on the bridge side instead. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Contributor
Author
|
/ok to test b32f868 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cp_comm_type='a2a+p2p'is used withouthierarchical_context_parallel_sizesuse_decentralized_pg=True) which does not support hierarchical CP groupsProblem
Setting
cp_comm_type: "a2a+p2p"withouthierarchical_context_parallel_sizescauses context parallel communication to be silently disabled. Each CP rank only attends to its local1/CPchunk instead of the full sequence. Symptoms include:Megatron-LM catches this via an assertion in
arguments.py:460-462, but that only runs through the CLI path. Megatron Bridge (and any code that constructsTransformerConfigdirectly) bypasses that check entirely.What changed
MCore (
TransformerConfig.__post_init__)Two assertions added alongside the existing
cp_comm_typetype checks:cp_comm_typecontains"a2a+p2p"→ requirehierarchical_context_parallel_sizesto be set (with a descriptive error explaining the failure mode)hierarchical_context_parallel_sizesis set → validateprod(sizes) == context_parallel_sizeBridge (
initialize.py)The decentralized PG path (
_create_pg_collection/HyperCommGrid) hardcodeshcp=Noneand cannot create hierarchical CP groups. Added aNotImplementedErrorguard so users get a clear error instead of silent breakage.How to use
a2a+p2pcorrectlya2a+p2pis a hierarchical CP communication strategy: a2a (all-to-all, like DeepSpeed Ulysses) within a sub-group (typically intra-node via NVLink), and p2p (ring) between sub-groups (typically inter-node via IBLink).To enable it, both settings are required:
The product of
hierarchical_context_parallel_sizesmust equalcontext_parallel_size(e.g.8 * 2 = 16). The first value is the a2a sub-group size (typically GPUs per node), and the second is the p2p sub-group size (number of nodes in the CP group).Note:
use_decentralized_pgmust beFalse(the default) when usinga2a+p2p, since the decentralized PG path does not support hierarchical CP groups.Test plan
a2a+p2pis not used in any existing recipe)cp_comm_type="a2a+p2p"withouthierarchical_context_parallel_sizescp_comm_type="a2a+p2p"+hierarchical_context_parallel_sizes=[8, 2]on multi-nodeMade with Cursor