-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Feature Request: Context Parallelism Support for NemotronV3
Summary
NeMo AutoModel's context parallelism (cp_size > 1) does not support NemotronV3 models. The apply_cp function in parallelizer.py assumes all blocks have a self_attn attribute, but NemotronV3's hybrid Mamba2 + Attention + MoE architecture uses NemotronV3Block which does not expose self_attn on any block type (it uses mixer instead), and not all blocks even have attention.
Why This Matters
NemotronV3-Nano-30B-A3B is a 30B-total / 3B-active MoE model competitive with Qwen3-30B-A3B, but without CP support the maximum feasible sequence length is ~16K-32K on H200 (with activation checkpointing). This makes it unusable for long-context fine-tuning tasks that require 200K+ context, which are straightforward with Qwen3 + CP.
Adding CP support for NemotronV3 would unlock long-context fine-tuning for this architecture.
Reproduction
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Image: nvcr.io/nvidia/nemo-automodel:25.11.00
Config: Identical to a working Qwen3-30B-A3B config (cp=8, ep=4, seq_length=225280) with only the model swapped.
Error with attn: te backend
File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
apply_cp(model, world_mesh[cp_axis_name])
File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
attn_module = block.self_attn.attn_module
^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'
Error with attn: sdpa backend
Identical failure — the error occurs before the attention backend matters, during CP mesh setup:
File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
apply_cp(model, world_mesh[cp_axis_name])
File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
attn_module = block.self_attn.attn_module
^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'
This confirms the issue is in the CP parallelization layer, not the attention backend.
Call Stack
nemo_trainer.py:142 run_training → recipe.setup()
→ train_ft.py:937 setup → build_model()
→ train_ft.py:225 build_model → cfg_model.instantiate()
→ auto_model.py:454 from_pretrained → cls._build_model()
→ auto_model.py:316 _build_model → apply_model_infrastructure()
→ infrastructure.py:448 → _shard_ep_fsdp()
→ infrastructure.py:124 → parallelize_fn()
→ parallelizer.py:310 parallelize_model → apply_cp()
→ parallelizer.py:276 apply_cp → block.self_attn.attn_module ← FAILS
Environment
- NeMo AutoModel:
nvcr.io/nvidia/nemo-automodel:25.11.00(r0.3.0) - Hardware: 8x H200 per node, 8 nodes (64 GPUs)
- Parallelism: cp=8, ep=4, dp=8 (FSDP2)
Note
Seemingly nemotron-v3 with context parallelism is possible in Megatron-Bridge currently