[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type by yaoyu-33 · Pull Request #2665 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-05T17:25:37Z

Summary

Add config-time validation to prevent silent training degradation when cp_comm_type='a2a+p2p' is used without hierarchical_context_parallel_sizes
Add guard in Bridge's decentralized PG path (use_decentralized_pg=True) which does not support hierarchical CP groups

Problem

Setting cp_comm_type: "a2a+p2p" without hierarchical_context_parallel_sizes causes context parallel communication to be silently disabled. Each CP rank only attends to its local 1/CP chunk instead of the full sequence. Symptoms include:

Suspiciously high throughput (e.g. 4200 TPS/GPU instead of ~1200)
Impossibly high TFLOP/s (e.g. 1937 TFLOP/s/GPU — over 2x the H200 bf16 theoretical max)
Degraded model quality — loss curve is smooth but higher than expected; model appears to "learn" in a fragmented fashion

Megatron-LM catches this via an assertion in arguments.py:460-462, but that only runs through the CLI path. Megatron Bridge (and any code that constructs TransformerConfig directly) bypasses that check entirely.

What changed

MCore (`TransformerConfig.__post_init__`)

Two assertions added alongside the existing cp_comm_type type checks:

When cp_comm_type contains "a2a+p2p" → require hierarchical_context_parallel_sizes to be set (with a descriptive error explaining the failure mode)
When hierarchical_context_parallel_sizes is set → validate prod(sizes) == context_parallel_size

Bridge (`initialize.py`)

The decentralized PG path (_create_pg_collection / HyperCommGrid) hardcodes hcp=None and cannot create hierarchical CP groups. Added a NotImplementedError guard so users get a clear error instead of silent breakage.

How to use `a2a+p2p` correctly

a2a+p2p is a hierarchical CP communication strategy: a2a (all-to-all, like DeepSpeed Ulysses) within a sub-group (typically intra-node via NVLink), and p2p (ring) between sub-groups (typically inter-node via IBLink).

To enable it, both settings are required:

model:
  context_parallel_size: 16
  cp_comm_type: "a2a+p2p"
  hierarchical_context_parallel_sizes: [8, 2]  # a2a group size, p2p group size

The product of hierarchical_context_parallel_sizes must equal context_parallel_size (e.g. 8 * 2 = 16). The first value is the a2a sub-group size (typically GPUs per node), and the second is the p2p sub-group size (number of nodes in the CP group).

Note: use_decentralized_pg must be False (the default) when using a2a+p2p, since the decentralized PG path does not support hierarchical CP groups.

Test plan

Verify existing unit tests pass (no regressions — a2a+p2p is not used in any existing recipe)
Verify assertion fires when cp_comm_type="a2a+p2p" without hierarchical_context_parallel_sizes
Verify training works correctly with cp_comm_type="a2a+p2p" + hierarchical_context_parallel_sizes=[8, 2] on multi-node

Made with Cursor

…p2p cp_comm_type Prevent silent training degradation when cp_comm_type='a2a+p2p' is used without hierarchical_context_parallel_sizes. Without this validation, context parallel communication is silently disabled — each CP rank attends only to its local chunk, producing artificially high throughput but broken training. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

copy-pr-bot · 2026-03-05T17:25:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ore submodule Revert the Megatron-LM submodule bump and move the hierarchical_context_parallel_sizes / cp_comm_type validations into ConfigContainer._validate_cp_comm_type() on the bridge side instead. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-05T19:47:07Z

/ok to test b32f868

copy-pr-bot bot temporarily deployed to test March 5, 2026 19:47 Inactive

yaoyu-33 mentioned this pull request Mar 5, 2026

cp_comm_type=a2a+p2p silently disables context parallelism without hierarchical_context_parallel_sizes #2667

Open

copy-pr-bot bot temporarily deployed to public March 5, 2026 19:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 21:19 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 21:28 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665

[training] fix: validate hierarchical_context_parallel_sizes for a2a+p2p cp_comm_type#2665
yaoyu-33 wants to merge 2 commits intomainfrom
yuya/validate-hcp-cp-comm-type

yaoyu-33 commented Mar 5, 2026

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyu-33 commented Mar 5, 2026

Summary

Problem

What changed

MCore (TransformerConfig.__post_init__)

Bridge (initialize.py)

How to use a2a+p2p correctly

Test plan

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MCore (`TransformerConfig.__post_init__`)

Bridge (`initialize.py`)

How to use `a2a+p2p` correctly