Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Dec 12, 2025

This pull request introduces support for optimizer hyperparameter presets, specifically a "speedrun" preset for both AdamW and Muon optimizers, and updates configuration files to utilize this feature for more streamlined experiment setup and reproducibility. The changes also add a new YAML preset for comparing Muon and AdamW optimizers on the minipile dataset.

Optimizer preset support and integration:

  • Added a new --optimizer_preset argument to the training script, allowing users to select from predefined optimizer hyperparameter configurations (currently "none" or "speedrun"). (train_args.py)
  • Implemented the _apply_optimizer_presets method in the training logic to automatically set optimizer hyperparameters when the "speedrun" preset is selected, with specific values for AdamW and Muon. This method is now called during optimizer creation. (train.py)

Experiment configuration updates:

  • Added a new experiment configuration file, muon_speedrun_preset.yaml, to compare NanoGPT speedrun optimizer presets for Muon and AdamW on the minipile dataset, leveraging the new preset functionality.
  • Updated muon_vs_adamw.yaml to include an AdamW optimizer group, enabling direct comparison with Muon in experiments.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces optimizer hyperparameter presets to streamline experiment configuration and improve reproducibility. The main addition is a "speedrun" preset that applies predefined hyperparameters for AdamW and Muon optimizers, along with YAML configuration files to leverage this new functionality.

Key Changes:

  • Added --optimizer_preset CLI argument with "speedrun" preset support for AdamW and Muon optimizers
  • Implemented automatic hyperparameter application through _apply_optimizer_presets() method called during optimizer creation
  • Created new experiment configurations comparing optimizer performance with and without presets

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
train_args.py Adds --optimizer_preset argument to support predefined optimizer hyperparameter configurations
train.py Implements _apply_optimizer_presets() method to apply speedrun preset values for AdamW and Muon optimizers
explorations/muon_vs_adamw.yaml Adds AdamW parameter group to enable comparison experiments
explorations/muon_speedrun_preset.yaml New configuration file to compare speedrun presets for Muon and AdamW on minipile dataset

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if self.args.optimizer == "adamw":
self.args.learning_rate = 6e-4
self.args.beta1, self.args.beta2 = 0.9, 0.95
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes beta1 and beta2 are being set here, but based on the codebase, the AdamW optimizer in optimizer_variants.py uses args.beta1 and args.beta2 (not args.adamw_betas). While this appears correct, it's worth noting that there's also an --adamw_betas argument defined in train_args.py that takes 2 values, which could create confusion. Consider whether the preset should also update adamw_betas for consistency, or if the naming convention should be clarified.

Suggested change
self.args.beta1, self.args.beta2 = 0.9, 0.95
self.args.beta1, self.args.beta2 = 0.9, 0.95
self.args.adamw_betas = (0.9, 0.95)

Copilot uses AI. Check for mistakes.
- optimizer: ["muon"]
optimizer_preset: ["speedrun"]
muon_momentum: [0.95]
- optimizer: ["adamw"]
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three parameter groups defined here, with line 9 appearing to be a duplicate adamw entry without any configuration. The first two groups (lines 4-5 and 6-8) have specific configurations with the speedrun preset, but the third group (line 9) only specifies the optimizer without any preset or other parameters. This creates ambiguity about what configuration this third group represents. Consider removing this duplicate entry or clarifying its purpose if it's intended to represent a baseline adamw configuration without the preset.

Suggested change
- optimizer: ["adamw"]

Copilot uses AI. Check for mistakes.
optimizer_preset: ["speedrun"]
- optimizer: ["muon"]
optimizer_preset: ["speedrun"]
muon_momentum: [0.95]
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter group on line 8 explicitly sets muon_momentum: [0.95], but the speedrun preset (line 7) already sets this exact value in train.py line 512. This creates redundancy where the same value is specified in both the YAML configuration and the preset code. Consider removing the explicit muon_momentum setting from the YAML since the preset handles this, unless there's a specific reason to document or override the preset value.

Suggested change
muon_momentum: [0.95]

Copilot uses AI. Check for mistakes.
Comment on lines +345 to +351
training_group.add_argument(
"--optimizer_preset",
type=str,
default="none",
choices=["none", "speedrun"],
help="Optional preset hyperparameters for supported optimizers (currently AdamW and Muon).",
)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --optimizer_preset argument is positioned in the middle of optimizer-specific sections (between MUON and ADAMW). Since this preset applies to multiple optimizers (AdamW and Muon as noted in the help text), consider moving this argument to a more general location, such as right after the --optimizer argument or in a separate "Optimizer Presets" section, to better reflect its cross-optimizer nature and improve discoverability.

Copilot uses AI. Check for mistakes.
adamw_eps: ["1e-8"]
- optimizer: ["muon"]
learning_rate: ["0.05"]
muon_momentum: [0.95]
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter group only specifies the optimizer without any configuration parameters. While this might be intentional to test default AdamW settings alongside the configured groups above, it creates ambiguity about the purpose of this third group. Consider either removing this if it's not needed, or adding a comment to clarify that this represents a baseline/default AdamW configuration.

Suggested change
muon_momentum: [0.95]
muon_momentum: [0.95]
# Baseline/default AdamW configuration (uses all default AdamW settings)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant