-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Description
Setting cp_comm_type: a2a+p2p without hierarchical_context_parallel_sizes causes context parallel attention to be silently disabled. Training appears to work — loss decreases, metrics are logged — but each CP rank only attends to its local 1/CP_SIZE slice of the sequence instead of the full context. Reported TFLOP/s dramatically exceeds hardware limits (1937 TFLOP/s on H200, peak ~990).
MCore's CLI validates this combination (arguments.py:448-450), but Bridge bypasses arguments.py and has no equivalent check.
Root Cause
- Without
hierarchical_context_parallel_sizes,_HIERARCHICAL_CONTEXT_PARALLEL_GROUPSis never set (parallel_state.py:983-996) TEDotProductAttention.__init__overwrites a validcp_groupwithget_hierarchical_context_parallel_groups(check_initialized=False)which returnsNone(transformer_engine.py:1242-1250)- TE's
DotProductAttention.forwardsilently falls through tocp_size=1whencp_group=None
No error, no warning, no crash — just incorrect attention.
Reproduction
model:
cp_comm_type: a2a+p2p
context_parallel_size: 16
# hierarchical_context_parallel_sizes NOT set4 nodes x 8 H200 GPUs, nvcr.io/nvidia/nemo:26.02.
Suggested Fix
Add validation in Bridge's config.py (or ModelConfig.finalize()):
if model_config.cp_comm_type == "a2a+p2p":
assert model_config.hierarchical_context_parallel_sizes is not None, (
"hierarchical_context_parallel_sizes must be set when cp_comm_type is a2a+p2p"
)
assert (
np.prod(model_config.hierarchical_context_parallel_sizes)
== model_config.context_parallel_size
), "Product of hierarchical_context_parallel_sizes must equal context_parallel_size"This mirrors the existing MCore CLI check in arguments.py:448-450.
Correct Configuration
model:
cp_comm_type: a2a+p2p
context_parallel_size: 16
hierarchical_context_parallel_sizes: [8, 2] # 8 intra-node a2a, 2 inter-node p2pEnvironment
- Container:
nvcr.io/nvidia/nemo:26.02 - Megatron-LM:
core_r0.16.0