fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400
Open
yuchenwang3 wants to merge 1 commit into
Open
fix(optimizer): route GatedDeltaNet in_proj to Adam instead of orthogonalizing it (Muon)#5400yuchenwang3 wants to merge 1 commit into
yuchenwang3 wants to merge 1 commit into
Conversation
…onalizing (Muon) The old hard guards blocking Muon for linear/gated attention were removed, but GatedDeltaNet in_proj.weight (a 2D fused q/k/v/conv/gate/beta matrix) is now silently routed to Muon and orthogonalized as a single matrix, which is not meaningful for this heterogeneous fused projection. Tag it with skip_orthogonalization so the existing predicate routes it to Adam, the same way embedding/output params are excluded. See NVIDIA#2885. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Routes the GatedDeltaNet
in_projweight to the fallback (Adam) optimizer so it is not orthogonalized by Muon. Addresses the open question in #2885.Background (re #2885)
#2885 asks whether linear-attention / gated-attention params could be "routed to Adam while the rest of the layers continue to use Muon", and a reply notes this "would require explicit parameter tagging, split optimizer param groups, and coordinated stepping and checkpointing."
Since then the old hard guards (
assert args.linear_attention_type is None, .../assert not args.attention_output_gate, ...) have been removed frommain. But the routing that the discussion called for was not added, so today a GatedDeltaNet model trained with--optimizer muonsilently sendsin_proj.weightto Muon:in_projpacks the per-head q, k, v, conv, gate, beta projections into one matrix (in_proj_dim = qk_dim*2 + v_dim*2 + num_value_heads*2,gated_delta_net.py)._is_nonlinear_or_embedding(megatron/core/optimizer/emerging_optimizers.py) only excludes embedding/output params and non-2D params.in_proj.weightis 2D and untagged → it goes to Muon and is orthogonalized as a single matrix.linear_qkv, which is handled by the--muon-split-qkvpath;in_projhas no such split and additionally fuses conv/gate/beta).Change
Reuses the existing Adam-routing infrastructure (so no new "split groups / checkpoint coordination" is needed):
gated_delta_net.py: tagin_proj.weightwithskip_orthogonalization = True(mirroring how embeddings setis_embedding_or_output_parameter).emerging_optimizers.py:_is_nonlinear_or_embeddingalso returnsTruefor params flaggedskip_orthogonalization, so they route to Adam.out_proj(a regular 2D projection) is intentionally left on Muon.Notes / open to alternatives
skip_orthogonalizationattribute rather than overloadingis_embedding_or_output_parameter(semantically clearer; other fused modules could reuse it). Happy to rename or switch to a name-basedParamKeyoverride if you prefer.in_projlike QKV and orthogonalizing the q/k/v sub-blocks while keeping conv/gate/beta on Adam — is more invasive; this PR is the conservative routing fix. Glad to pursue the split approach instead if that's the preferred direction.Verified the predicate and tagging compile; could not run the full optimizer test suite locally — relying on CI.