Skip to content

cp_comm_type=a2a+p2p silently disables context parallelism without hierarchical_context_parallel_sizes #2667

@shanecmoran

Description

@shanecmoran

Description

Setting cp_comm_type: a2a+p2p without hierarchical_context_parallel_sizes causes context parallel attention to be silently disabled. Training appears to work — loss decreases, metrics are logged — but each CP rank only attends to its local 1/CP_SIZE slice of the sequence instead of the full context. Reported TFLOP/s dramatically exceeds hardware limits (1937 TFLOP/s on H200, peak ~990).

MCore's CLI validates this combination (arguments.py:448-450), but Bridge bypasses arguments.py and has no equivalent check.

Root Cause

  1. Without hierarchical_context_parallel_sizes, _HIERARCHICAL_CONTEXT_PARALLEL_GROUPS is never set (parallel_state.py:983-996)
  2. TEDotProductAttention.__init__ overwrites a valid cp_group with get_hierarchical_context_parallel_groups(check_initialized=False) which returns None (transformer_engine.py:1242-1250)
  3. TE's DotProductAttention.forward silently falls through to cp_size=1 when cp_group=None

No error, no warning, no crash — just incorrect attention.

Reproduction

model:
  cp_comm_type: a2a+p2p
  context_parallel_size: 16
  # hierarchical_context_parallel_sizes NOT set

4 nodes x 8 H200 GPUs, nvcr.io/nvidia/nemo:26.02.

Suggested Fix

Add validation in Bridge's config.py (or ModelConfig.finalize()):

if model_config.cp_comm_type == "a2a+p2p":
    assert model_config.hierarchical_context_parallel_sizes is not None, (
        "hierarchical_context_parallel_sizes must be set when cp_comm_type is a2a+p2p"
    )
    assert (
        np.prod(model_config.hierarchical_context_parallel_sizes)
        == model_config.context_parallel_size
    ), "Product of hierarchical_context_parallel_sizes must equal context_parallel_size"

This mirrors the existing MCore CLI check in arguments.py:448-450.

Correct Configuration

model:
  cp_comm_type: a2a+p2p
  context_parallel_size: 16
  hierarchical_context_parallel_sizes: [8, 2]  # 8 intra-node a2a, 2 inter-node p2p

Environment

  • Container: nvcr.io/nvidia/nemo:26.02
  • Megatron-LM: core_r0.16.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions