cp_comm_type=a2a+p2p silently disables context parallelism without hierarchical_context_parallel_sizes

## Description

Setting `cp_comm_type: a2a+p2p` without `hierarchical_context_parallel_sizes` causes context parallel attention to be silently disabled. Training appears to work — loss decreases, metrics are logged — but each CP rank only attends to its local 1/CP_SIZE slice of the sequence instead of the full context. Reported TFLOP/s dramatically exceeds hardware limits (1937 TFLOP/s on H200, peak ~990).

MCore's CLI validates this combination (`arguments.py:448-450`), but Bridge bypasses `arguments.py` and has no equivalent check.

## Root Cause

1. Without `hierarchical_context_parallel_sizes`, `_HIERARCHICAL_CONTEXT_PARALLEL_GROUPS` is never set (`parallel_state.py:983-996`)
2. `TEDotProductAttention.__init__` overwrites a valid `cp_group` with `get_hierarchical_context_parallel_groups(check_initialized=False)` which returns `None` (`transformer_engine.py:1242-1250`)
3. TE's `DotProductAttention.forward` silently falls through to `cp_size=1` when `cp_group=None`

No error, no warning, no crash — just incorrect attention.

## Reproduction

```yaml
model:
  cp_comm_type: a2a+p2p
  context_parallel_size: 16
  # hierarchical_context_parallel_sizes NOT set
```

4 nodes x 8 H200 GPUs, `nvcr.io/nvidia/nemo:26.02`.

## Suggested Fix

Add validation in Bridge's `config.py` (or `ModelConfig.finalize()`):

```python
if model_config.cp_comm_type == "a2a+p2p":
    assert model_config.hierarchical_context_parallel_sizes is not None, (
        "hierarchical_context_parallel_sizes must be set when cp_comm_type is a2a+p2p"
    )
    assert (
        np.prod(model_config.hierarchical_context_parallel_sizes)
        == model_config.context_parallel_size
    ), "Product of hierarchical_context_parallel_sizes must equal context_parallel_size"
```

This mirrors the existing MCore CLI check in `arguments.py:448-450`.

## Correct Configuration

```yaml
model:
  cp_comm_type: a2a+p2p
  context_parallel_size: 16
  hierarchical_context_parallel_sizes: [8, 2]  # 8 intra-node a2a, 2 inter-node p2p
```

## Environment

- Container: `nvcr.io/nvidia/nemo:26.02`
- Megatron-LM: `core_r0.16.0`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp_comm_type=a2a+p2p silently disables context parallelism without hierarchical_context_parallel_sizes #2667

Description

Root Cause

Reproduction

Suggested Fix

Correct Configuration

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cp_comm_type=a2a+p2p silently disables context parallelism without hierarchical_context_parallel_sizes #2667

Description

Description

Root Cause

Reproduction

Suggested Fix

Correct Configuration

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions