Skip to content

[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665

Draft
yaoyu-33 wants to merge 2 commits intomainfrom
yuya/validate-hcp-cp-comm-type
Draft

[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665
yaoyu-33 wants to merge 2 commits intomainfrom
yuya/validate-hcp-cp-comm-type

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 5, 2026

Summary

  • Add config-time validation to prevent silent training degradation when cp_comm_type='a2a+p2p' is used without hierarchical_context_parallel_sizes
  • Add guard in Bridge's decentralized PG path (use_decentralized_pg=True) which does not support hierarchical CP groups

Problem

Setting cp_comm_type: "a2a+p2p" without hierarchical_context_parallel_sizes causes context parallel communication to be silently disabled. Each CP rank only attends to its local 1/CP chunk instead of the full sequence. Symptoms include:

  • Suspiciously high throughput (e.g. 4200 TPS/GPU instead of ~1200)
  • Impossibly high TFLOP/s (e.g. 1937 TFLOP/s/GPU — over 2x the H200 bf16 theoretical max)
  • Degraded model quality — loss curve is smooth but higher than expected; model appears to "learn" in a fragmented fashion

Megatron-LM catches this via an assertion in arguments.py:460-462, but that only runs through the CLI path. Megatron Bridge (and any code that constructs TransformerConfig directly) bypasses that check entirely.

What changed

MCore (TransformerConfig.__post_init__)

Two assertions added alongside the existing cp_comm_type type checks:

  1. When cp_comm_type contains "a2a+p2p" → require hierarchical_context_parallel_sizes to be set (with a descriptive error explaining the failure mode)
  2. When hierarchical_context_parallel_sizes is set → validate prod(sizes) == context_parallel_size

Bridge (initialize.py)

The decentralized PG path (_create_pg_collection / HyperCommGrid) hardcodes hcp=None and cannot create hierarchical CP groups. Added a NotImplementedError guard so users get a clear error instead of silent breakage.

How to use a2a+p2p correctly

a2a+p2p is a hierarchical CP communication strategy: a2a (all-to-all, like DeepSpeed Ulysses) within a sub-group (typically intra-node via NVLink), and p2p (ring) between sub-groups (typically inter-node via IBLink).

To enable it, both settings are required:

model:
  context_parallel_size: 16
  cp_comm_type: "a2a+p2p"
  hierarchical_context_parallel_sizes: [8, 2]  # a2a group size, p2p group size

The product of hierarchical_context_parallel_sizes must equal context_parallel_size (e.g. 8 * 2 = 16). The first value is the a2a sub-group size (typically GPUs per node), and the second is the p2p sub-group size (number of nodes in the CP group).

Note: use_decentralized_pg must be False (the default) when using a2a+p2p, since the decentralized PG path does not support hierarchical CP groups.

Test plan

  • Verify existing unit tests pass (no regressions — a2a+p2p is not used in any existing recipe)
  • Verify assertion fires when cp_comm_type="a2a+p2p" without hierarchical_context_parallel_sizes
  • Verify training works correctly with cp_comm_type="a2a+p2p" + hierarchical_context_parallel_sizes=[8, 2] on multi-node

Made with Cursor

…p2p cp_comm_type

Prevent silent training degradation when cp_comm_type='a2a+p2p' is used
without hierarchical_context_parallel_sizes. Without this validation,
context parallel communication is silently disabled — each CP rank
attends only to its local chunk, producing artificially high throughput
but broken training.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ore submodule

Revert the Megatron-LM submodule bump and move the
hierarchical_context_parallel_sizes / cp_comm_type validations into
ConfigContainer._validate_cp_comm_type() on the bridge side instead.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33
Copy link
Contributor Author

yaoyu-33 commented Mar 5, 2026

/ok to test b32f868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant