Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/muon-dapo-reward.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/muon-dapo-val-acc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/muon-sft-comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
139 changes: 139 additions & 0 deletions docs/guides/muon-optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Muon Optimizer

This guide explains how to use the Muon optimizer with NeMo RL for training large language models.

## What is Muon?

[**Muon** (MomentUm Orthogonalized by Newton-schulz)](https://arxiv.org/abs/2502.16982) is an optimizer from Moonshot AI that achieves higher sample efficiency compared to AdamW. It applies Newton-Schulz orthogonalization to momentum-based updates, which helps prevent weight matrices from becoming poorly conditioned during training. Muon is used for linear layers while Adam handles non-linear parameters (embeddings, layer norms).

## Requirements

Muon is only supported with the **Megatron backend**. Ensure you have:

1. Megatron submodules initialized: `git submodule update --init --recursive`
2. Megatron backend enabled in your configuration: `policy.megatron_cfg.enabled=True`

## Basic Usage

To use Muon with NeMo RL, you need to configure the optimizer through the Megatron configuration. Here's an example command for SFT:

```bash
uv run examples/run_sft.py \
policy.megatron_cfg.enabled=true \
policy.dtensor_cfg.enabled=true \
++policy.megatron_cfg.optimizer.optimizer=dist_muon \
++policy.megatron_cfg.optimizer.muon_scale_mode=spectral \
++policy.megatron_cfg.optimizer.muon_momentum=0.9 \
++policy.megatron_cfg.optimizer.muon_use_nesterov=False \
++policy.megatron_cfg.optimizer.muon_extra_scale_factor=0.5 \
policy.megatron_cfg.optimizer.use_precision_aware_optimizer=false \
policy.megatron_cfg.optimizer.use_distributed_optimizer=false
```

For a full list of Muon-related arguments and a description of each, please refer to the [Megatron documentation](https://github.com/terrykong/Megatron-LM/blob/25a62edf77b5130f888328ca8119d6a76117cf23/megatron/core/optimizer/optimizer_config.py#L128-L150).

_NOTE_: precision_aware_optimizer and distributed_optimizer are not supported with Muon and should be disabled.

## Example YAML Configuration

Here's an example of a complete Megatron optimizer configuration for Muon:

```yaml
policy:
megatron_cfg:
enabled: true

optimizer:
optimizer: "dist_muon"
lr: 1e-4
min_lr: 1e-5
weight_decay: 0.1
bf16: true
fp16: false

# Muon-specific settings
muon_momentum: 0.95
muon_use_nesterov: false
muon_scale_mode: "spectral"
muon_fp32_matmul_prec: "medium"
muon_num_ns_steps: 5
muon_tp_mode: "blockwise"
muon_extra_scale_factor: 1.0
muon_split_qkv: true

# Disable for Muon
use_distributed_optimizer: false
use_precision_aware_optimizer: false

clip_grad: 1.0

scheduler:
lr_decay_style: "cosine"
lr_warmup_iters: 100
lr_decay_iters: 1000
weight_decay_incr_style: "constant"
start_weight_decay: 0.1
end_weight_decay: 0.1
```

## Experimental Results

Muon support in NeMo-RL is experimental. We have tested Muon for SFT and RL on models pre-trained with Adam. While Muon is expected to show the greatest benefit when used for both pre-training and post-training, we have observed minor improvements even when applying Muon only during post-training.

For example, the following is a comparison between Adam and Muon for running SFT on Qwen3-235B-A22B:

<p align="center">
<img src="../assets/muon-sft-comparison.png" alt="Muon vs Adam SFT Comparison" width="600">
</p>

The full Muon command used for this run is:

```
uv run examples/run_sft.py
--config examples/configs/sft_openmathinstruct2_megatron.yaml \
++policy.megatron_cfg.optimizer.optimizer=dist_muon \
++policy.megatron_cfg.optimizer.muon_scale_mode=spectral \
++policy.megatron_cfg.optimizer.muon_momentum=0.9 \
++policy.megatron_cfg.optimizer.muon_use_nesterov=False \
++policy.megatron_cfg.optimizer.muon_extra_scale_factor=0.2 \
policy.megatron_cfg.optimizer.use_precision_aware_optimizer=false \
++policy.megatron_cfg.optimizer.lr=2e-5 \
policy.megatron_cfg.optimizer.use_distributed_optimizer=False \
cluster.num_nodes=4 \
cluster.gpus_per_node=8 \
policy.megatron_cfg.pipeline_model_parallel_size=8 \
policy.megatron_cfg.sequence_parallel=True \
policy.megatron_cfg.expert_model_parallel_size=8 \
policy.megatron_cfg.tensor_model_parallel_size=8 \
policy.sequence_packing.enabled=True \
policy.model_name=Qwen/Qwen3-235B-A22B \
policy.tokenizer.name=Qwen/Qwen3-235B-A22B checkpointing.enabled=True \
cluster.num_nodes=16 \
policy.megatron_cfg.num_layers_in_first_pipeline_stage=11 \
policy.megatron_cfg.num_layers_in_last_pipeline_stage=11
```


Here is a comparison of Muon vs Adam for DAPO with Qwen3.5-7B:

<p align="center">
<img src="../assets/muon-dapo-reward.png" alt="Muon vs Adam DAPO Train Reward" height="300">
<img src="../assets/muon-dapo-val-acc.png" alt="Muon vs Adam DAPO Validation Accuracy" height="300">
</p>

The command to generate the Muon results is:

```
uv run examples/run_grpo_math.py \
--config examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml \
++policy.megatron_cfg.optimizer.optimizer=dist_muon \
++policy.megatron_cfg.optimizer.muon_scale_mode=spectral \
++policy.megatron_cfg.optimizer.muon_momentum=0.9 \
++policy.megatron_cfg.optimizer.muon_use_nesterov=False \
++policy.megatron_cfg.optimizer.muon_extra_scale_factor=0.5 \
policy.megatron_cfg.optimizer.use_precision_aware_optimizer=false \
policy.megatron_cfg.optimizer.use_distributed_optimizer=False \
cluster.num_nodes=16 cluster.gpus_per_node=8 \
policy.sequence_packing.enabled=True \
~checkpointing.model_save_format
```
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@ guides/eval.md
guides/deepseek.md
model-quirks.md
guides/async-grpo.md
guides/muon-optimizer.md
```

```{toctree}
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ mcore = [
"flash-attn==2.8.1",
# Remove this once https://github.com/NVIDIA-NeMo/RL/issues/501 resolved
"vllm==0.11.2",
"emerging-optimizers==0.1.0",
]
nemo_gym = ["nemo_gym"]

Expand Down
52 changes: 27 additions & 25 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading