Skip to content

fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400

Open
yuchenwang3 wants to merge 1 commit into
NVIDIA:mainfrom
yuchenwang3:fix/gdn-in-proj-skip-muon
Open

fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400
yuchenwang3 wants to merge 1 commit into
NVIDIA:mainfrom
yuchenwang3:fix/gdn-in-proj-skip-muon

Conversation

@yuchenwang3

Copy link
Copy Markdown

Summary

Routes the GatedDeltaNet in_proj weight to the fallback (Adam) optimizer so it is not orthogonalized by Muon. Addresses the open question in #2885.

Background (re #2885)

#2885 asks whether linear-attention / gated-attention params could be "routed to Adam while the rest of the layers continue to use Muon", and a reply notes this "would require explicit parameter tagging, split optimizer param groups, and coordinated stepping and checkpointing."

Since then the old hard guards (assert args.linear_attention_type is None, ... / assert not args.attention_output_gate, ...) have been removed from main. But the routing that the discussion called for was not added, so today a GatedDeltaNet model trained with --optimizer muon silently sends in_proj.weight to Muon:

  • in_proj packs the per-head q, k, v, conv, gate, beta projections into one matrix (in_proj_dim = qk_dim*2 + v_dim*2 + num_value_heads*2, gated_delta_net.py).
  • The Muon routing predicate _is_nonlinear_or_embedding (megatron/core/optimizer/emerging_optimizers.py) only excludes embedding/output params and non-2D params. in_proj.weight is 2D and untagged → it goes to Muon and is orthogonalized as a single matrix.
  • Orthogonalizing this heterogeneous fused weight as one matrix is not meaningful (unlike attention linear_qkv, which is handled by the --muon-split-qkv path; in_proj has no such split and additionally fuses conv/gate/beta).

Change

Reuses the existing Adam-routing infrastructure (so no new "split groups / checkpoint coordination" is needed):

  1. gated_delta_net.py: tag in_proj.weight with skip_orthogonalization = True (mirroring how embeddings set is_embedding_or_output_parameter).
  2. emerging_optimizers.py: _is_nonlinear_or_embedding also returns True for params flagged skip_orthogonalization, so they route to Adam.

out_proj (a regular 2D projection) is intentionally left on Muon.

Notes / open to alternatives

  • Introduced a generic skip_orthogonalization attribute rather than overloading is_embedding_or_output_parameter (semantically clearer; other fused modules could reuse it). Happy to rename or switch to a name-based ParamKey override if you prefer.
  • The alternative — splitting in_proj like QKV and orthogonalizing the q/k/v sub-blocks while keeping conv/gate/beta on Adam — is more invasive; this PR is the conservative routing fix. Glad to pursue the split approach instead if that's the preferred direction.

Verified the predicate and tagging compile; could not run the full optimizer test suite locally — relying on CI.

…onalizing (Muon)

The old hard guards blocking Muon for linear/gated attention were removed,
but GatedDeltaNet in_proj.weight (a 2D fused q/k/v/conv/gate/beta matrix) is
now silently routed to Muon and orthogonalized as a single matrix, which is
not meaningful for this heterogeneous fused projection. Tag it with
skip_orthogonalization so the existing predicate routes it to Adam, the same
way embedding/output params are excluded. See NVIDIA#2885.

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@yuchenwang3 yuchenwang3 requested review from a team as code owners June 18, 2026 07:33
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft June 18, 2026 07:33
@github-actions

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants