Skip to content

[Feature Request] Context Parallelism Support for NemotronV3 #1409

@LoganVegnaSHOP

Description

@LoganVegnaSHOP

Feature Request: Context Parallelism Support for NemotronV3

Summary

NeMo AutoModel's context parallelism (cp_size > 1) does not support NemotronV3 models. The apply_cp function in parallelizer.py assumes all blocks have a self_attn attribute, but NemotronV3's hybrid Mamba2 + Attention + MoE architecture uses NemotronV3Block which does not expose self_attn on any block type (it uses mixer instead), and not all blocks even have attention.

Why This Matters

NemotronV3-Nano-30B-A3B is a 30B-total / 3B-active MoE model competitive with Qwen3-30B-A3B, but without CP support the maximum feasible sequence length is ~16K-32K on H200 (with activation checkpointing). This makes it unusable for long-context fine-tuning tasks that require 200K+ context, which are straightforward with Qwen3 + CP.

Adding CP support for NemotronV3 would unlock long-context fine-tuning for this architecture.

Reproduction

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Image: nvcr.io/nvidia/nemo-automodel:25.11.00
Config: Identical to a working Qwen3-30B-A3B config (cp=8, ep=4, seq_length=225280) with only the model swapped.

Error with attn: te backend

File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
    apply_cp(model, world_mesh[cp_axis_name])
  File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
    attn_module = block.self_attn.attn_module
                  ^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'

Error with attn: sdpa backend

Identical failure — the error occurs before the attention backend matters, during CP mesh setup:

File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
    apply_cp(model, world_mesh[cp_axis_name])
  File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
    attn_module = block.self_attn.attn_module
                  ^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'

This confirms the issue is in the CP parallelization layer, not the attention backend.

Call Stack

nemo_trainer.py:142 run_training → recipe.setup()
  → train_ft.py:937 setup → build_model()
  → train_ft.py:225 build_model → cfg_model.instantiate()
  → auto_model.py:454 from_pretrained → cls._build_model()
  → auto_model.py:316 _build_model → apply_model_infrastructure()
  → infrastructure.py:448 → _shard_ep_fsdp()
  → infrastructure.py:124 → parallelize_fn()
  → parallelizer.py:310 parallelize_model → apply_cp()
  → parallelizer.py:276 apply_cp → block.self_attn.attn_module  ← FAILS

Environment

  • NeMo AutoModel: nvcr.io/nvidia/nemo-automodel:25.11.00 (r0.3.0)
  • Hardware: 8x H200 per node, 8 nodes (64 GPUs)
  • Parallelism: cp=8, ep=4, dp=8 (FSDP2)

Note

Seemingly nemotron-v3 with context parallelism is possible in Megatron-Bridge currently

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions