[Feature Request] Context Parallelism Support for NemotronV3

# Feature Request: Context Parallelism Support for NemotronV3

## Summary

NeMo AutoModel's context parallelism (`cp_size > 1`) does not support NemotronV3 models. The `apply_cp` function in `parallelizer.py` assumes all blocks have a `self_attn` attribute, but NemotronV3's hybrid Mamba2 + Attention + MoE architecture uses `NemotronV3Block` which does not expose `self_attn` on any block type (it uses `mixer` instead), and not all blocks even have attention.

## Why This Matters

NemotronV3-Nano-30B-A3B is a 30B-total / 3B-active MoE model competitive with Qwen3-30B-A3B, but without CP support the maximum feasible sequence length is ~16K-32K on H200 (with activation checkpointing). This makes it unusable for long-context fine-tuning tasks that require 200K+ context, which are straightforward with Qwen3 + CP.

Adding CP support for NemotronV3 would unlock long-context fine-tuning for this architecture.

## Reproduction

**Model:** `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`
**Image:** `nvcr.io/nvidia/nemo-automodel:25.11.00`
**Config:** Identical to a working Qwen3-30B-A3B config (cp=8, ep=4, seq_length=225280) with only the model swapped.

### Error with `attn: te` backend 

```
File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
    apply_cp(model, world_mesh[cp_axis_name])
  File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
    attn_module = block.self_attn.attn_module
                  ^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'
```

### Error with `attn: sdpa` backend

Identical failure — the error occurs before the attention backend matters, during CP mesh setup:

```
File ".../nemo_automodel/components/moe/parallelizer.py", line 310, in parallelize_model
    apply_cp(model, world_mesh[cp_axis_name])
  File ".../nemo_automodel/components/moe/parallelizer.py", line 276, in apply_cp
    attn_module = block.self_attn.attn_module
                  ^^^^^^^^^^^^^^^
AttributeError: 'NemotronV3Block' object has no attribute 'self_attn'
```

This confirms the issue is in the CP parallelization layer, not the attention backend.

## Call Stack

```
nemo_trainer.py:142 run_training → recipe.setup()
  → train_ft.py:937 setup → build_model()
  → train_ft.py:225 build_model → cfg_model.instantiate()
  → auto_model.py:454 from_pretrained → cls._build_model()
  → auto_model.py:316 _build_model → apply_model_infrastructure()
  → infrastructure.py:448 → _shard_ep_fsdp()
  → infrastructure.py:124 → parallelize_fn()
  → parallelizer.py:310 parallelize_model → apply_cp()
  → parallelizer.py:276 apply_cp → block.self_attn.attn_module  ← FAILS
```

## Environment

- **NeMo AutoModel:** `nvcr.io/nvidia/nemo-automodel:25.11.00` (r0.3.0)
- **Hardware:** 8x H200 per node, 8 nodes (64 GPUs)
- **Parallelism:** cp=8, ep=4, dp=8 (FSDP2)


### Note
Seemingly nemotron-v3 with context parallelism is possible in Megatron-Bridge currently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Context Parallelism Support for NemotronV3 #1409

Feature Request: Context Parallelism Support for NemotronV3

Summary

Why This Matters

Reproduction

Error with `attn: te` backend

Error with `attn: sdpa` backend

Call Stack

Environment

Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Context Parallelism Support for NemotronV3 #1409

Description

Feature Request: Context Parallelism Support for NemotronV3

Summary

Why This Matters

Reproduction

Error with attn: te backend

Error with attn: sdpa backend

Call Stack

Environment

Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Error with `attn: te` backend

Error with `attn: sdpa` backend