fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon) by yuchenwang3 · Pull Request #5400 · NVIDIA/Megatron-LM

yuchenwang3 · 2026-06-18T07:33:24Z

Summary

Routes the GatedDeltaNet in_proj weight to the fallback (Adam) optimizer so it is not orthogonalized by Muon. Addresses the open question in #2885.

Background (re #2885)

#2885 asks whether linear-attention / gated-attention params could be "routed to Adam while the rest of the layers continue to use Muon", and a reply notes this "would require explicit parameter tagging, split optimizer param groups, and coordinated stepping and checkpointing."

Since then the old hard guards (assert args.linear_attention_type is None, ... / assert not args.attention_output_gate, ...) have been removed from main. But the routing that the discussion called for was not added, so today a GatedDeltaNet model trained with --optimizer muon silently sends in_proj.weight to Muon:

in_proj packs the per-head q, k, v, conv, gate, beta projections into one matrix (in_proj_dim = qk_dim*2 + v_dim*2 + num_value_heads*2, gated_delta_net.py).
The Muon routing predicate _is_nonlinear_or_embedding (megatron/core/optimizer/emerging_optimizers.py) only excludes embedding/output params and non-2D params. in_proj.weight is 2D and untagged → it goes to Muon and is orthogonalized as a single matrix.
Orthogonalizing this heterogeneous fused weight as one matrix is not meaningful (unlike attention linear_qkv, which is handled by the --muon-split-qkv path; in_proj has no such split and additionally fuses conv/gate/beta).

Change

Reuses the existing Adam-routing infrastructure (so no new "split groups / checkpoint coordination" is needed):

gated_delta_net.py: tag in_proj.weight with skip_orthogonalization = True (mirroring how embeddings set is_embedding_or_output_parameter).
emerging_optimizers.py: _is_nonlinear_or_embedding also returns True for params flagged skip_orthogonalization, so they route to Adam.

out_proj (a regular 2D projection) is intentionally left on Muon.

Notes / open to alternatives

Introduced a generic skip_orthogonalization attribute rather than overloading is_embedding_or_output_parameter (semantically clearer; other fused modules could reuse it). Happy to rename or switch to a name-based ParamKey override if you prefer.
The alternative — splitting in_proj like QKV and orthogonalizing the q/k/v sub-blocks while keeping conv/gate/beta on Adam — is more invasive; this PR is the conservative routing fix. Glad to pursue the split approach instead if that's the preferred direction.

Verified the predicate and tagging compile; could not run the full optimizer test suite locally — relying on CI.

…onalizing (Muon) The old hard guards blocking Muon for linear/gated attention were removed, but GatedDeltaNet in_proj.weight (a 2D fused q/k/v/conv/gate/beta matrix) is now silently routed to Muon and orthogonalized as a single matrix, which is not meaningful for this heterogeneous fused projection. Tag it with skip_orthogonalization so the existing predicate routes it to Adam, the same way embedding/output params are excluded. See NVIDIA#2885. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

copy-pr-bot · 2026-06-18T07:33:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-18T07:33:35Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

yuchenwang3 requested review from a team as code owners June 18, 2026 07:33

svcnvidia-nemo-ci marked this pull request as draft June 18, 2026 07:33

github-actions Bot added the community-request label Jun 18, 2026

yuchenwang3 mentioned this pull request Jun 18, 2026

Is there a hard blocker preventing Muon to be used with gated deltanet and gated attention? #2885

Open

yuchenwang3 marked this pull request as ready for review June 18, 2026 17:19

svcnvidia-nemo-ci requested a review from a team June 18, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400

fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400
yuchenwang3 wants to merge 1 commit into
NVIDIA:mainfrom
yuchenwang3:fix/gdn-in-proj-skip-muon

yuchenwang3 commented Jun 18, 2026

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuchenwang3 commented Jun 18, 2026

Summary

Background (re #2885)

Change

Notes / open to alternatives

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants