Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
1e9a874
feat(modelopt): mamba/hybrid branch in quantization_layer_spec
mxinO May 6, 2026
4b13374
test(modelopt): cover hybrid_layer_pattern and mamba_num_heads-only c…
mxinO May 6, 2026
fa1c7bc
feat(modelopt): Nano3 NVFP4 quant recipes (default/input-only/weight-…
mxinO May 6, 2026
b85720d
docs(modelopt): add metadata-informational note and rename resolve test
mxinO May 6, 2026
103aafe
fix(modelopt): use deny-all-then-enable pattern in Nano3 NVFP4 recipes
mxinO May 6, 2026
99180f3
fix(modelopt): align Nano3 NVFP4 YAMLs with NANO3_NVFP4_CFG
mxinO May 6, 2026
57f2c6c
docs: open debug log for Nano3 W4A4 NaN-amax issue
mxinO May 6, 2026
d538006
docs: record W4A4 NaN-amax investigation steps 1-5
mxinO May 6, 2026
5b2788a
docs: confirm W4A4 NaN-amax root cause + record fix
mxinO May 6, 2026
44605ea
docs: redirect W4A4 NaN-amax fix to MaxCalibrator.collect
mxinO May 6, 2026
8a3c0be
docs: refine W4A4 fix to handle MoE cross-rank check
mxinO May 6, 2026
4f6ed1a
docs: complete debug log with correct W4A4 NaN-amax root cause
mxinO May 7, 2026
7487b21
fix(modelopt-vllm): tolerate dummy-weight NaN in fakequant prolog
mxinO May 7, 2026
b3ee539
Keep local debug logs out of the branch
mxinO May 8, 2026
0ebb7b6
Leave repository ignore rules unchanged
mxinO May 8, 2026
8d3d666
Keep only Nano3 weight-only quant recipe
mxinO May 21, 2026
c610245
Remove empty modelopt unit test package
mxinO May 21, 2026
2cef209
Add Nano3 ModelOpt training examples
mxinO May 22, 2026
3330b08
Restore Nano3 W4A4 ModelOpt recipe
mxinO May 22, 2026
b0deae2
Keep Nano3 QARL smoke on full-parameter path
mxinO May 22, 2026
660ac08
Exercise QADOPD through NeMo Gym
mxinO May 23, 2026
fe8ba62
Add Qwen3 MoE QARL nightly coverage
mxinO May 28, 2026
27706ba
Align Qwen3 QARL example naming
mxinO May 28, 2026
d4ea940
Tighten Qwen3 QARL smoke assertions
mxinO May 28, 2026
62b41bc
Relax Qwen3 QARL smoke loss window
mxinO May 28, 2026
a2bdc94
Account for Qwen3 QARL nightly recipe
mxinO May 29, 2026
cbcd6c2
Route quantized MoE generation through Triton by default
mxinO May 30, 2026
e8aa158
Drop redundant quant worker helper test
mxinO May 30, 2026
0f474c4
Advance ModelOpt dependency for quant MoE support
mxinO May 30, 2026
936d759
Keep Nano3 QAD example on tested CP path
mxinO May 30, 2026
c033e3c
Document MoE and Mamba QARL support accurately
mxinO Jun 4, 2026
ccd0677
Preserve portable ModelOpt layer spec serialization
mxinO Jun 5, 2026
f48e5f6
Support ModelOpt Mamba stack specs during import
mxinO Jun 5, 2026
6dd72b9
Stop advertising legacy QARL export compatibility
mxinO Jun 5, 2026
1304d2a
Document the faster QARL layer-spec default
mxinO Jun 5, 2026
699fdaf
Align Megatron import tests with Mamba spec forwarding
mxinO Jun 5, 2026
99d3305
Address QARL review coverage and logging gaps
mxinO Jun 9, 2026
25c86a4
Make QARL layer-spec selection config-only
mxinO Jun 9, 2026
ddf4638
Clarify QARL layer-spec guidance
mxinO Jun 9, 2026
44d13d2
Keep Nano3 QA distillation nightly measurable
mxinO Jun 9, 2026
84976ce
Keep Nano3 nightly recipe minimized
mxinO Jun 9, 2026
dcfd7ab
Keep Nano3 QARL nightly grouped with distillation
mxinO Jun 11, 2026
066c32c
Keep QARL suite coverage aligned after rebase
mxinO Jun 11, 2026
ef91285
Keep distillation aligned with Nemo Gym spinup
mxinO Jun 11, 2026
5654e57
Merge remote-tracking branch 'origin/main' into mxin/moe-mamba-sft
mxinO Jun 12, 2026
82bf068
Merge remote-tracking branch 'origin/main' into mxin/moe-mamba-sft
mxinO Jun 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions docs/guides/quantization-aware-rl.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,20 @@ The following workflow + quantization recipe combinations have been validated en
| QA-Distillation | W4A4 | `NVFP4_DEFAULT_CFG` (NVFP4 weights + NVFP4 activations) | ✅ Converges | `examples/modelopt/qa_distillation_math_megatron.yaml` |
| QA-GRPO | W4A16 | `examples/modelopt/quant_configs/nvfp4_a16.yaml` (NVFP4 weights, native-dtype activations) | ✅ Converges | `examples/modelopt/qa_grpo_llama8b_megatron.v2.yaml` |
| QA-GRPO | W4A4 | `NVFP4_DEFAULT_CFG` | ⚠️ Known convergence issue | `examples/modelopt/qa_grpo_math_megatron.yaml` |
| QA-Distillation | W4A4 | `examples/modelopt/quant_configs/nano3_nvfp4_default.yaml` | ✅ Converges | `examples/modelopt/qa_distillation_nano3_megatron.yaml` |
| QA-GRPO | W4A16 | `NVFP4_MLP_WEIGHT_ONLY_CFG` | ✅ Smoke tested on MoE | `examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml` |

The `nvfp4_a16.yaml` custom YAML enables NVFP4 e2m1 weight quantization (with dynamic e4m3 micro-block scales) and leaves activations unquantized; weights are still exercised through both Megatron training and vLLM generation.

## ModelOpt Layer Spec Toggle

For QARL configs, try setting `policy.disable_modelopt_layer_spec=true` first.
This keeps ModelOpt quantization enabled while using the standard Megatron layer
specs instead of ModelOpt's custom layer specs. This is usually faster and works
for most models, but it is not guaranteed for every architecture or recipe. If
you encounter errors with the standard Megatron layer specs, leave it unset or
set it to `false` to exercise ModelOpt's Megatron layer-spec path.

## Quantization-Aware GRPO (QA-GRPO)

### Configuration
Expand Down Expand Up @@ -154,7 +165,7 @@ uv run --extra mcore --extra modelopt \
--tp 1 --pp <pipeline-parallel-size>
```

- `examples/modelopt/export_quantized_to_hf.py` is a thin wrapper around `Megatron-Bridge/examples/quantization/export.py` that registers `nemo_rl.` as an allowed `_target_` prefix so the saved layer-spec callback in QARL checkpoints can be instantiated during export. All CLI flags pass through to the upstream script unchanged.
- `examples/modelopt/export_quantized_to_hf.py` is a thin wrapper around `Megatron-Bridge/examples/quantization/export.py`. All CLI flags pass through to the upstream script unchanged.
- `--hf-model-id` should point to the original (pre-training) HuggingFace model so that the exporter knows the model architecture and tokenizer.
- The `PYTHONPATH` prefix exposes Megatron-LM's `megatron.training` to the bridge script.
- **`--tp 1` is required**: modelopt currently does not support TP>1 at export time. Training at TP>1 is fine; the bridge re-shards on load via `mp_overrides`.
Expand All @@ -165,4 +176,4 @@ uv run --extra mcore --extra modelopt \
- **Generation**: Currently only vLLM is supported for generation.
- **DTensor backend**: Quantization support for the DTensor policy worker is not yet implemented.
- **Input quantization**: Only per-tensor input (activation) quantization is supported.
- **Model support**: MoE (Mixture of Experts) and Mamba models are currently not supported.
- **Model support**: Dense Transformer, MoE (Mixture of Experts), and hybrid MoE/Mamba models are supported on the Megatron policy + vLLM generation path when Megatron-Bridge and ModelOpt support the model architecture and quantization recipe. MoE/Mamba support is currently covered by smoke-tested example configs rather than broad convergence guarantees.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
defaults: ../../../../examples/modelopt/qa_distillation_nano3_megatron.yaml
distillation:
num_prompts_per_step: 4
max_num_steps: 1
val_period: 0
policy:
disable_modelopt_layer_spec: false
train_global_batch_size: 16
max_total_sequence_length: 2048
quant_calib_size: 16
quant_sequence_length: 1024
megatron_cfg:
tensor_model_parallel_size: 1
scheduler:
lr_warmup_iters: 1
generation:
vllm_kwargs:
tokenizer: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
logger:
log_dir: logs/distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
wandb:
project: nemo-rl
name: distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
tensorboard:
log_dir: tb_logs-distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
cluster:
gpus_per_node: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
defaults: ../../../../examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml
18 changes: 3 additions & 15 deletions examples/modelopt/export_quantized_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,15 @@

"""Wrapper around Megatron-Bridge's quantization/export.py for QARL checkpoints.

QARL checkpoints written by ``MegatronQuantPolicyWorker`` store the layer-spec
callback as ``nemo_rl.modelopt.models.policy.workers.utils.quantization_layer_spec``.
Megatron-Bridge's instantiator rejects ``_target_`` strings outside its built-in
allowlist (``megatron.``, ``nemo.``, ``torch.``, ``transformers.``, ``numpy.``,
``nvidia.``), so the upstream export script fails on these checkpoints. The
training worker registers ``nemo_rl.`` at ``__init__`` time, but the allowlist
is process-local — the upstream export script runs in a fresh ``torchrun``
process and never instantiates the worker.

This wrapper registers ``nemo_rl.`` as an allowed prefix and then delegates to
``Megatron-Bridge/examples/quantization/export.py`` unchanged. All CLI arguments
pass through.
This keeps the NeMo RL example entry point next to the QARL recipes while
delegating to ``Megatron-Bridge/examples/quantization/export.py`` unchanged.
All CLI arguments pass through.
"""

import runpy
import sys
from pathlib import Path

from megatron.bridge.utils.instantiate_utils import register_allowed_target_prefix

UPSTREAM_EXPORT = (
Path(__file__).resolve().parents[2]
/ "3rdparty"
Expand All @@ -46,7 +35,6 @@


def main() -> None:
register_allowed_target_prefix("nemo_rl.")
if not UPSTREAM_EXPORT.is_file():
raise FileNotFoundError(
f"Megatron-Bridge export script not found at {UPSTREAM_EXPORT}. "
Expand Down
179 changes: 179 additions & 0 deletions examples/modelopt/qa_distillation_nano3_megatron.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Nano3 Quantization-Aware Distillation / OPD (QADOPD) with Megatron,
# vLLM, and ModelOpt.
#
# The student policy and rollout worker use Nano3 NVFP4 W4A4 quantization; the
# teacher stays BF16. This keeps the on-policy distillation target full
# precision while training the quantized student in the same loop.
#
# Example usage:
# uv run examples/nemo_gym/run_distillation_nemo_gym.py --config examples/modelopt/qa_distillation_nano3_megatron.yaml

defaults: "./qa_distillation_math_megatron.yaml"

distillation:
# Keep the example moderate; scale these for production runs.
num_prompts_per_step: 16
num_generations_per_prompt: 4
max_val_samples: null

loss_fn:
# Reverse KL is the usual OPD direction for matching the teacher on student
# rollouts while avoiding full support coverage pressure from forward KL.
kl_type: "reverse"
mixed_kl_weight: 0.0

policy:
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
tokenizer:
name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
chat_template_kwargs: null
hf_config_overrides:
router_aux_loss_coef: 0
train_global_batch_size: 64
train_micro_batch_size: 1
generation_batch_size: 64
logprob_batch_size: 1
max_total_sequence_length: 8192
logprob_chunk_size: 1024
offload_optimizer_for_logprob: true
disable_modelopt_layer_spec: true
# Keep this example on the non-packed CP=1 path covered by the Nano3
# QAD smoke. Re-enable CP>1 only with a targeted Nano3 Mamba+distillation
# smoke test.
sequence_packing:
enabled: false

# Nano3 is a hybrid MoE/Mamba model. This recipe keeps attention and the
# known Nano3-sensitive layers in BF16, while applying NVFP4 to weights and
# layer-input activations.
quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_default.yaml"
quant_calib_data: "cnn_dailymail"
quant_calib_size: 512
quant_batch_size: 1
quant_sequence_length: 2048

megatron_cfg:
empty_unused_memory_level: 1
activation_checkpointing: true
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
context_parallel_size: 1
expert_tensor_parallel_size: 1
expert_model_parallel_size: 8
sequence_parallel: true
freeze_moe_router: true
moe_router_dtype: "fp32"
moe_router_load_balancing_type: "none"
moe_router_bias_update_rate: 1e-3
moe_permute_fusion: true
moe_enable_deepep: false
moe_token_dispatcher_type: "alltoall"
moe_shared_expert_overlap: false
apply_rope_fusion: true
bias_activation_fusion: false
defer_fp32_logits: true
gradient_accumulation_fusion: false
scheduler:
lr_warmup_iters: 20

make_sequence_length_divisible_by: ${mul:${mul:${.megatron_cfg.tensor_model_parallel_size}, ${.megatron_cfg.context_parallel_size}}, 2}

generation:
max_new_tokens: ${..max_total_sequence_length}
temperature: 1.0
top_p: 1.0
top_k: null
quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_default.yaml"
vllm_cfg:
expose_http_server: true
tensor_parallel_size: 4
pipeline_parallel_size: 1
expert_parallel_size: 1
gpu_memory_utilization: 0.7
max_model_len: ${...max_total_sequence_length}
enforce_eager: false
vllm_kwargs:
# Nano3 Mamba cache is numerically safer in FP32 during generation.
mamba_ssm_cache_dtype: "float32"
compilation_config:
backend: "eager"

teacher:
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
tokenizer:
name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
chat_template_kwargs: null
hf_config_overrides:
router_aux_loss_coef: 0
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: ${policy.max_total_sequence_length}
logprob_chunk_size: ${policy.logprob_chunk_size}
offload_optimizer_for_logprob: true
# Keep teacher logprob forwards on the same tested path as policy.
sequence_packing:
enabled: false
megatron_cfg:
empty_unused_memory_level: 1
activation_checkpointing: true
tensor_model_parallel_size: ${policy.megatron_cfg.tensor_model_parallel_size}
pipeline_model_parallel_size: ${policy.megatron_cfg.pipeline_model_parallel_size}
context_parallel_size: ${policy.megatron_cfg.context_parallel_size}
expert_tensor_parallel_size: ${policy.megatron_cfg.expert_tensor_parallel_size}
expert_model_parallel_size: ${policy.megatron_cfg.expert_model_parallel_size}
sequence_parallel: ${policy.megatron_cfg.sequence_parallel}
freeze_moe_router: ${policy.megatron_cfg.freeze_moe_router}
moe_router_dtype: ${policy.megatron_cfg.moe_router_dtype}
moe_router_load_balancing_type: ${policy.megatron_cfg.moe_router_load_balancing_type}
moe_router_bias_update_rate: ${policy.megatron_cfg.moe_router_bias_update_rate}
moe_permute_fusion: ${policy.megatron_cfg.moe_permute_fusion}
moe_enable_deepep: ${policy.megatron_cfg.moe_enable_deepep}
moe_token_dispatcher_type: ${policy.megatron_cfg.moe_token_dispatcher_type}
moe_shared_expert_overlap: ${policy.megatron_cfg.moe_shared_expert_overlap}
apply_rope_fusion: ${policy.megatron_cfg.apply_rope_fusion}
bias_activation_fusion: ${policy.megatron_cfg.bias_activation_fusion}
defer_fp32_logits: ${policy.megatron_cfg.defer_fp32_logits}
gradient_accumulation_fusion: false

data:
max_input_seq_length: null
shuffle: false
num_workers: 0
# Existing Gym-format QA fixture. Real training should point to data produced
# by `ng_prepare_data`.
train:
dataset_name: NemoGymDataset
data_path: "3rdparty/Gym-workspace/Gym/responses_api_agents/verifiers_agent/data/acereason-math-example.jsonl"
validation: null
default:
dataset_name: NemoGymDataset
env_name: "nemo_gym"
prompt_file: null
system_prompt_file: null
processor: "nemo_gym_data_processor"

env:
should_use_nemo_gym: true
should_log_nemo_gym_responses: false
nemo_gym:
rollout_max_attempts_to_avoid_lp_nan: 1
config_paths:
- responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
- responses_api_agents/verifiers_agent/configs/acereason-math.yaml

checkpointing:
checkpoint_dir: "checkpoints/qa-distillation-nano3-megatron"

logger:
log_dir: "logs/qa_distillation_nano3_megatron"
wandb:
project: "nemo-qa-distillation"
name: "qa-distillation-nano3-megatron"
tensorboard:
log_dir: "tb_logs-qa-distillation-nano3-megatron"
mlflow:
run_name: "qa-distillation-nano3-megatron"

cluster:
gpus_per_node: 8
num_nodes: 4
75 changes: 75 additions & 0 deletions examples/modelopt/qa_grpo_nano3_megatron.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Nano3 Quantization-Aware RL (QARL) with Megatron, vLLM, and ModelOpt.
#
# This config layers Nano3 NVFP4 weight-only quantization onto a full-parameter
# Nano3 Megatron GRPO setup. LoRA/PEFT is intentionally out of scope here.
#
# Example usage:
# uv run examples/run_grpo.py --config examples/modelopt/qa_grpo_nano3_megatron.yaml

defaults: "../configs/grpo_math_1B.yaml"

grpo:
num_prompts_per_step: 2
num_generations_per_prompt: 8

policy:
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
tokenizer:
name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
train_global_batch_size: 16
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 2048
disable_modelopt_layer_spec: true

# Nano3 is a hybrid MoE/Mamba model. This recipe keeps attention and the
# known Nano3-sensitive layers in BF16, while applying NVFP4 to weights.
quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml"
quant_calib_data: "cnn_dailymail"
quant_calib_size: 512
quant_batch_size: 1
quant_sequence_length: 2048

dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
bias_activation_fusion: false
tensor_model_parallel_size: 2
expert_model_parallel_size: 8
sequence_parallel: true
# ModelOpt quantization is incompatible with gradient accumulation fusion.
gradient_accumulation_fusion: false
sequence_packing:
enabled: false

# Nano3 GRPO uses sequence parallelism with TP=2 in the base recipe, so
# rollout/logprob batches must pad sequence length to the TP multiple.
make_sequence_length_divisible_by: ${.megatron_cfg.tensor_model_parallel_size}

generation:
# Match the student's quantization recipe during vLLM rollouts.
quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml"
vllm_cfg:
tensor_parallel_size: 4
gpu_memory_utilization: 0.7
vllm_kwargs:
# Nano3 Mamba cache is numerically safer in FP32 during generation.
mamba_ssm_cache_dtype: "float32"
compilation_config:
backend: "eager"

checkpointing:
checkpoint_dir: "results/qa_grpo_nano3_megatron"

logger:
log_dir: "logs/qa_grpo_nano3_megatron"
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: "nemo-rl"
name: "qa-grpo-nano3-megatron"

cluster:
gpus_per_node: 8
num_nodes: 2
Loading
Loading