NVIDIA-NeMo · mxinO · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
@@ -17,9 +17,20 @@ The following workflow + quantization recipe combinations have been validated en
 | QA-Distillation | W4A4 | `NVFP4_DEFAULT_CFG` (NVFP4 weights + NVFP4 activations) | ✅ Converges | `examples/modelopt/qa_distillation_math_megatron.yaml` |
 | QA-GRPO | W4A16 | `examples/modelopt/quant_configs/nvfp4_a16.yaml` (NVFP4 weights, native-dtype activations) | ✅ Converges | `examples/modelopt/qa_grpo_llama8b_megatron.v2.yaml` |
 | QA-GRPO | W4A4 | `NVFP4_DEFAULT_CFG` | ⚠️ Known convergence issue | `examples/modelopt/qa_grpo_math_megatron.yaml` |
+| QA-Distillation | W4A4 | `examples/modelopt/quant_configs/nano3_nvfp4_default.yaml` | ✅ Converges | `examples/modelopt/qa_distillation_nano3_megatron.yaml` |
+| QA-GRPO | W4A16 | `NVFP4_MLP_WEIGHT_ONLY_CFG` | ✅ Smoke tested on MoE | `examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml` |
 
 The `nvfp4_a16.yaml` custom YAML enables NVFP4 e2m1 weight quantization (with dynamic e4m3 micro-block scales) and leaves activations unquantized; weights are still exercised through both Megatron training and vLLM generation.
 
+## ModelOpt Layer Spec Toggle
+
+For QARL configs, try setting `policy.disable_modelopt_layer_spec=true` first.
+This keeps ModelOpt quantization enabled while using the standard Megatron layer
+specs instead of ModelOpt's custom layer specs. This is usually faster and works
+for most models, but it is not guaranteed for every architecture or recipe. If
+you encounter errors with the standard Megatron layer specs, leave it unset or
+set it to `false` to exercise ModelOpt's Megatron layer-spec path.
+
 ## Quantization-Aware GRPO (QA-GRPO)
 
 ### Configuration
@@ -154,7 +165,7 @@ uv run --extra mcore --extra modelopt \
   --tp 1 --pp <pipeline-parallel-size>
 ```
 
-- `examples/modelopt/export_quantized_to_hf.py` is a thin wrapper around `Megatron-Bridge/examples/quantization/export.py` that registers `nemo_rl.` as an allowed `_target_` prefix so the saved layer-spec callback in QARL checkpoints can be instantiated during export. All CLI flags pass through to the upstream script unchanged.
+- `examples/modelopt/export_quantized_to_hf.py` is a thin wrapper around `Megatron-Bridge/examples/quantization/export.py`. All CLI flags pass through to the upstream script unchanged.
 - `--hf-model-id` should point to the original (pre-training) HuggingFace model so that the exporter knows the model architecture and tokenizer.
 - The `PYTHONPATH` prefix exposes Megatron-LM's `megatron.training` to the bridge script.
 - **`--tp 1` is required**: modelopt currently does not support TP>1 at export time. Training at TP>1 is fine; the bridge re-shards on load via `mp_overrides`.
@@ -165,4 +176,4 @@ uv run --extra mcore --extra modelopt \
 - **Generation**: Currently only vLLM is supported for generation.
 - **DTensor backend**: Quantization support for the DTensor policy worker is not yet implemented.
 - **Input quantization**: Only per-tensor input (activation) quantization is supported.
-- **Model support**: MoE (Mixture of Experts) and Mamba models are currently not supported.
+- **Model support**: Dense Transformer, MoE (Mixture of Experts), and hybrid MoE/Mamba models are supported on the Megatron policy + vLLM generation path when Megatron-Bridge and ModelOpt support the model architecture and quantization recipe. MoE/Mamba support is currently covered by smoke-tested example configs rather than broad convergence guarantees.
@@ -0,0 +1,27 @@
+defaults: ../../../../examples/modelopt/qa_distillation_nano3_megatron.yaml
+distillation:
+  num_prompts_per_step: 4
+  max_num_steps: 1
+  val_period: 0
+policy:
+  disable_modelopt_layer_spec: false
+  train_global_batch_size: 16
+  max_total_sequence_length: 2048
+  quant_calib_size: 16
+  quant_sequence_length: 1024
+  megatron_cfg:
+    tensor_model_parallel_size: 1
+    scheduler:
+      lr_warmup_iters: 1
+  generation:
+    vllm_kwargs:
+      tokenizer: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
+logger:
+  log_dir: logs/distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
+  wandb:
+    project: nemo-rl
+    name: distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
+  tensorboard:
+    log_dir: tb_logs-distillation-nano3-30ba3b-4n4g-megatron-qa-nvfp4-modelopt-spec
+cluster:
+  gpus_per_node: 4
@@ -0,0 +1 @@
+defaults: ../../../../examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml
@@ -14,26 +14,15 @@
 
 """Wrapper around Megatron-Bridge's quantization/export.py for QARL checkpoints.
 
-QARL checkpoints written by ``MegatronQuantPolicyWorker`` store the layer-spec
-callback as ``nemo_rl.modelopt.models.policy.workers.utils.quantization_layer_spec``.
-Megatron-Bridge's instantiator rejects ``_target_`` strings outside its built-in
-allowlist (``megatron.``, ``nemo.``, ``torch.``, ``transformers.``, ``numpy.``,
-``nvidia.``), so the upstream export script fails on these checkpoints. The
-training worker registers ``nemo_rl.`` at ``__init__`` time, but the allowlist
-is process-local — the upstream export script runs in a fresh ``torchrun``
-process and never instantiates the worker.
-
-This wrapper registers ``nemo_rl.`` as an allowed prefix and then delegates to
-``Megatron-Bridge/examples/quantization/export.py`` unchanged. All CLI arguments
-pass through.
+This keeps the NeMo RL example entry point next to the QARL recipes while
+delegating to ``Megatron-Bridge/examples/quantization/export.py`` unchanged.
+All CLI arguments pass through.
 """
 
 import runpy
 import sys
 from pathlib import Path
 
-from megatron.bridge.utils.instantiate_utils import register_allowed_target_prefix
-
 UPSTREAM_EXPORT = (
     Path(__file__).resolve().parents[2]
     / "3rdparty"
@@ -46,7 +35,6 @@
 
 
 def main() -> None:
-    register_allowed_target_prefix("nemo_rl.")
     if not UPSTREAM_EXPORT.is_file():
         raise FileNotFoundError(
             f"Megatron-Bridge export script not found at {UPSTREAM_EXPORT}. "

@@ -0,0 +1,179 @@
+# Nano3 Quantization-Aware Distillation / OPD (QADOPD) with Megatron,
+# vLLM, and ModelOpt.
+#
+# The student policy and rollout worker use Nano3 NVFP4 W4A4 quantization; the
+# teacher stays BF16. This keeps the on-policy distillation target full
+# precision while training the quantized student in the same loop.
+#
+# Example usage:
+#   uv run examples/nemo_gym/run_distillation_nemo_gym.py --config examples/modelopt/qa_distillation_nano3_megatron.yaml
+
+defaults: "./qa_distillation_math_megatron.yaml"
+
+distillation:
+  # Keep the example moderate; scale these for production runs.
+  num_prompts_per_step: 16
+  num_generations_per_prompt: 4
+  max_val_samples: null
+
+loss_fn:
+  # Reverse KL is the usual OPD direction for matching the teacher on student
+  # rollouts while avoiding full support coverage pressure from forward KL.
+  kl_type: "reverse"
+  mixed_kl_weight: 0.0
+
+policy:
+  model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
+  tokenizer:
+    name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+    chat_template_kwargs: null
+  hf_config_overrides:
+    router_aux_loss_coef: 0
+  train_global_batch_size: 64
+  train_micro_batch_size: 1
+  generation_batch_size: 64
+  logprob_batch_size: 1
+  max_total_sequence_length: 8192
+  logprob_chunk_size: 1024
+  offload_optimizer_for_logprob: true
+  disable_modelopt_layer_spec: true
+  # Keep this example on the non-packed CP=1 path covered by the Nano3
+  # QAD smoke. Re-enable CP>1 only with a targeted Nano3 Mamba+distillation
+  # smoke test.
+  sequence_packing:
+    enabled: false
+
+  # Nano3 is a hybrid MoE/Mamba model. This recipe keeps attention and the
+  # known Nano3-sensitive layers in BF16, while applying NVFP4 to weights and
+  # layer-input activations.
+  quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_default.yaml"
+  quant_calib_data: "cnn_dailymail"
+  quant_calib_size: 512
+  quant_batch_size: 1
+  quant_sequence_length: 2048
+
+  megatron_cfg:
+    empty_unused_memory_level: 1
+    activation_checkpointing: true
+    tensor_model_parallel_size: 4
+    pipeline_model_parallel_size: 1
+    context_parallel_size: 1
+    expert_tensor_parallel_size: 1
+    expert_model_parallel_size: 8
+    sequence_parallel: true
+    freeze_moe_router: true
+    moe_router_dtype: "fp32"
+    moe_router_load_balancing_type: "none"
+    moe_router_bias_update_rate: 1e-3
+    moe_permute_fusion: true
+    moe_enable_deepep: false
+    moe_token_dispatcher_type: "alltoall"
+    moe_shared_expert_overlap: false
+    apply_rope_fusion: true
+    bias_activation_fusion: false
+    defer_fp32_logits: true
+    gradient_accumulation_fusion: false
+    scheduler:
+      lr_warmup_iters: 20
+
+  make_sequence_length_divisible_by: ${mul:${mul:${.megatron_cfg.tensor_model_parallel_size}, ${.megatron_cfg.context_parallel_size}}, 2}
+
+  generation:
+    max_new_tokens: ${..max_total_sequence_length}
+    temperature: 1.0
+    top_p: 1.0
+    top_k: null
+    quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_default.yaml"
+    vllm_cfg:
+      expose_http_server: true
+      tensor_parallel_size: 4
+      pipeline_parallel_size: 1
+      expert_parallel_size: 1
+      gpu_memory_utilization: 0.7
+      max_model_len: ${...max_total_sequence_length}
+      enforce_eager: false
+    vllm_kwargs:
+      # Nano3 Mamba cache is numerically safer in FP32 during generation.
+      mamba_ssm_cache_dtype: "float32"
+      compilation_config:
+        backend: "eager"
+
+teacher:
+  model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
+  tokenizer:
+    name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+    chat_template_kwargs: null
+  hf_config_overrides:
+    router_aux_loss_coef: 0
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  max_total_sequence_length: ${policy.max_total_sequence_length}
+  logprob_chunk_size: ${policy.logprob_chunk_size}
+  offload_optimizer_for_logprob: true
+  # Keep teacher logprob forwards on the same tested path as policy.
+  sequence_packing:
+    enabled: false
+  megatron_cfg:
+    empty_unused_memory_level: 1
+    activation_checkpointing: true
+    tensor_model_parallel_size: ${policy.megatron_cfg.tensor_model_parallel_size}
+    pipeline_model_parallel_size: ${policy.megatron_cfg.pipeline_model_parallel_size}
+    context_parallel_size: ${policy.megatron_cfg.context_parallel_size}
+    expert_tensor_parallel_size: ${policy.megatron_cfg.expert_tensor_parallel_size}
+    expert_model_parallel_size: ${policy.megatron_cfg.expert_model_parallel_size}
+    sequence_parallel: ${policy.megatron_cfg.sequence_parallel}
+    freeze_moe_router: ${policy.megatron_cfg.freeze_moe_router}
+    moe_router_dtype: ${policy.megatron_cfg.moe_router_dtype}
+    moe_router_load_balancing_type: ${policy.megatron_cfg.moe_router_load_balancing_type}
+    moe_router_bias_update_rate: ${policy.megatron_cfg.moe_router_bias_update_rate}
+    moe_permute_fusion: ${policy.megatron_cfg.moe_permute_fusion}
+    moe_enable_deepep: ${policy.megatron_cfg.moe_enable_deepep}
+    moe_token_dispatcher_type: ${policy.megatron_cfg.moe_token_dispatcher_type}
+    moe_shared_expert_overlap: ${policy.megatron_cfg.moe_shared_expert_overlap}
+    apply_rope_fusion: ${policy.megatron_cfg.apply_rope_fusion}
+    bias_activation_fusion: ${policy.megatron_cfg.bias_activation_fusion}
+    defer_fp32_logits: ${policy.megatron_cfg.defer_fp32_logits}
+    gradient_accumulation_fusion: false
+
+data:
+  max_input_seq_length: null
+  shuffle: false
+  num_workers: 0
+  # Existing Gym-format QA fixture. Real training should point to data produced
+  # by `ng_prepare_data`.
+  train:
+    dataset_name: NemoGymDataset
+    data_path: "3rdparty/Gym-workspace/Gym/responses_api_agents/verifiers_agent/data/acereason-math-example.jsonl"
+  validation: null
+  default:
+    dataset_name: NemoGymDataset
+    env_name: "nemo_gym"
+    prompt_file: null
+    system_prompt_file: null
+    processor: "nemo_gym_data_processor"
+
+env:
+  should_use_nemo_gym: true
+  should_log_nemo_gym_responses: false
+  nemo_gym:
+    rollout_max_attempts_to_avoid_lp_nan: 1
+    config_paths:
+    - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
+    - responses_api_agents/verifiers_agent/configs/acereason-math.yaml
+
+checkpointing:
+  checkpoint_dir: "checkpoints/qa-distillation-nano3-megatron"
+
+logger:
+  log_dir: "logs/qa_distillation_nano3_megatron"
+  wandb:
+    project: "nemo-qa-distillation"
+    name: "qa-distillation-nano3-megatron"
+  tensorboard:
+    log_dir: "tb_logs-qa-distillation-nano3-megatron"
+  mlflow:
+    run_name: "qa-distillation-nano3-megatron"
+
+cluster:
+  gpus_per_node: 8
+  num_nodes: 4
@@ -0,0 +1,75 @@
+# Nano3 Quantization-Aware RL (QARL) with Megatron, vLLM, and ModelOpt.
+#
+# This config layers Nano3 NVFP4 weight-only quantization onto a full-parameter
+# Nano3 Megatron GRPO setup. LoRA/PEFT is intentionally out of scope here.
+#
+# Example usage:
+#   uv run examples/run_grpo.py --config examples/modelopt/qa_grpo_nano3_megatron.yaml
+
+defaults: "../configs/grpo_math_1B.yaml"
+
+grpo:
+  num_prompts_per_step: 2
+  num_generations_per_prompt: 8
+
+policy:
+  model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
+  tokenizer:
+    name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+  train_global_batch_size: 16
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  max_total_sequence_length: 2048
+  disable_modelopt_layer_spec: true
+
+  # Nano3 is a hybrid MoE/Mamba model. This recipe keeps attention and the
+  # known Nano3-sensitive layers in BF16, while applying NVFP4 to weights.
+  quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml"
+  quant_calib_data: "cnn_dailymail"
+  quant_calib_size: 512
+  quant_batch_size: 1
+  quant_sequence_length: 2048
+
+  dtensor_cfg:
+    enabled: false
+  megatron_cfg:
+    enabled: true
+    bias_activation_fusion: false
+    tensor_model_parallel_size: 2
+    expert_model_parallel_size: 8
+    sequence_parallel: true
+    # ModelOpt quantization is incompatible with gradient accumulation fusion.
+    gradient_accumulation_fusion: false
+  sequence_packing:
+    enabled: false
+
+  # Nano3 GRPO uses sequence parallelism with TP=2 in the base recipe, so
+  # rollout/logprob batches must pad sequence length to the TP multiple.
+  make_sequence_length_divisible_by: ${.megatron_cfg.tensor_model_parallel_size}
+
+  generation:
+    # Match the student's quantization recipe during vLLM rollouts.
+    quant_cfg: "examples/modelopt/quant_configs/nano3_nvfp4_weightonly.yaml"
+    vllm_cfg:
+      tensor_parallel_size: 4
+      gpu_memory_utilization: 0.7
+    vllm_kwargs:
+      # Nano3 Mamba cache is numerically safer in FP32 during generation.
+      mamba_ssm_cache_dtype: "float32"
+      compilation_config:
+        backend: "eager"
+
+checkpointing:
+  checkpoint_dir: "results/qa_grpo_nano3_megatron"
+
+logger:
+  log_dir: "logs/qa_grpo_nano3_megatron"
+  wandb_enabled: true
+  tensorboard_enabled: true
+  wandb:
+    project: "nemo-rl"
+    name: "qa-grpo-nano3-megatron"
+
+cluster:
+  gpus_per_node: 8
+  num_nodes: 2
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		defaults: ../../../../examples/modelopt/qa_grpo_qwen3_30ba3b_megatron.yaml