-
Notifications
You must be signed in to change notification settings - Fork 27
Opt muon settings #698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Opt muon settings #698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces optimizer hyperparameter presets to streamline experiment configuration and improve reproducibility. The main addition is a "speedrun" preset that applies predefined hyperparameters for AdamW and Muon optimizers, along with YAML configuration files to leverage this new functionality.
Key Changes:
- Added
--optimizer_presetCLI argument with "speedrun" preset support for AdamW and Muon optimizers - Implemented automatic hyperparameter application through
_apply_optimizer_presets()method called during optimizer creation - Created new experiment configurations comparing optimizer performance with and without presets
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| train_args.py | Adds --optimizer_preset argument to support predefined optimizer hyperparameter configurations |
| train.py | Implements _apply_optimizer_presets() method to apply speedrun preset values for AdamW and Muon optimizers |
| explorations/muon_vs_adamw.yaml | Adds AdamW parameter group to enable comparison experiments |
| explorations/muon_speedrun_preset.yaml | New configuration file to compare speedrun presets for Muon and AdamW on minipile dataset |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if self.args.optimizer == "adamw": | ||
| self.args.learning_rate = 6e-4 | ||
| self.args.beta1, self.args.beta2 = 0.9, 0.95 |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attributes beta1 and beta2 are being set here, but based on the codebase, the AdamW optimizer in optimizer_variants.py uses args.beta1 and args.beta2 (not args.adamw_betas). While this appears correct, it's worth noting that there's also an --adamw_betas argument defined in train_args.py that takes 2 values, which could create confusion. Consider whether the preset should also update adamw_betas for consistency, or if the naming convention should be clarified.
| self.args.beta1, self.args.beta2 = 0.9, 0.95 | |
| self.args.beta1, self.args.beta2 = 0.9, 0.95 | |
| self.args.adamw_betas = (0.9, 0.95) |
| - optimizer: ["muon"] | ||
| optimizer_preset: ["speedrun"] | ||
| muon_momentum: [0.95] | ||
| - optimizer: ["adamw"] |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three parameter groups defined here, with line 9 appearing to be a duplicate adamw entry without any configuration. The first two groups (lines 4-5 and 6-8) have specific configurations with the speedrun preset, but the third group (line 9) only specifies the optimizer without any preset or other parameters. This creates ambiguity about what configuration this third group represents. Consider removing this duplicate entry or clarifying its purpose if it's intended to represent a baseline adamw configuration without the preset.
| - optimizer: ["adamw"] |
| optimizer_preset: ["speedrun"] | ||
| - optimizer: ["muon"] | ||
| optimizer_preset: ["speedrun"] | ||
| muon_momentum: [0.95] |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter group on line 8 explicitly sets muon_momentum: [0.95], but the speedrun preset (line 7) already sets this exact value in train.py line 512. This creates redundancy where the same value is specified in both the YAML configuration and the preset code. Consider removing the explicit muon_momentum setting from the YAML since the preset handles this, unless there's a specific reason to document or override the preset value.
| muon_momentum: [0.95] |
| training_group.add_argument( | ||
| "--optimizer_preset", | ||
| type=str, | ||
| default="none", | ||
| choices=["none", "speedrun"], | ||
| help="Optional preset hyperparameters for supported optimizers (currently AdamW and Muon).", | ||
| ) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --optimizer_preset argument is positioned in the middle of optimizer-specific sections (between MUON and ADAMW). Since this preset applies to multiple optimizers (AdamW and Muon as noted in the help text), consider moving this argument to a more general location, such as right after the --optimizer argument or in a separate "Optimizer Presets" section, to better reflect its cross-optimizer nature and improve discoverability.
| adamw_eps: ["1e-8"] | ||
| - optimizer: ["muon"] | ||
| learning_rate: ["0.05"] | ||
| muon_momentum: [0.95] |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter group only specifies the optimizer without any configuration parameters. While this might be intentional to test default AdamW settings alongside the configured groups above, it creates ambiguity about the purpose of this third group. Consider either removing this if it's not needed, or adding a comment to clarify that this represents a baseline/default AdamW configuration.
| muon_momentum: [0.95] | |
| muon_momentum: [0.95] | |
| # Baseline/default AdamW configuration (uses all default AdamW settings) |
This pull request introduces support for optimizer hyperparameter presets, specifically a "speedrun" preset for both AdamW and Muon optimizers, and updates configuration files to utilize this feature for more streamlined experiment setup and reproducibility. The changes also add a new YAML preset for comparing Muon and AdamW optimizers on the minipile dataset.
Optimizer preset support and integration:
--optimizer_presetargument to the training script, allowing users to select from predefined optimizer hyperparameter configurations (currently "none" or "speedrun"). (train_args.py)_apply_optimizer_presetsmethod in the training logic to automatically set optimizer hyperparameters when the "speedrun" preset is selected, with specific values for AdamW and Muon. This method is now called during optimizer creation. (train.py)Experiment configuration updates:
muon_speedrun_preset.yaml, to compare NanoGPT speedrun optimizer presets for Muon and AdamW on the minipile dataset, leveraging the new preset functionality.muon_vs_adamw.yamlto include an AdamW optimizer group, enabling direct comparison with Muon in experiments.