Skip to content

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612

Open
seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice:submission/mlx-tuned-hparams-1.5096
Open

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612
seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice:submission/mlx-tuned-hparams-1.5096

Conversation

@seekerPrice
Copy link
Copy Markdown

Summary

  • Pure hyperparameter tuning gives -0.0500 BPB at 5000-step MLX scale (local A/B test)
  • Same architecture, same training config, only 4 hyperparameters changed
  • Local MLX only (M5 MacBook Pro) — H100 3-seed validation pending compute credit approval

A/B Evidence

Experiment Matrix LR Muon Momentum QK-Gain val_bpb
EXP-042 (SOTA defaults) 0.022 0.99 5.25 1.5596
EXP-048 (tuned) 0.02 0.95 4.0 1.5096
Δ: -0.0500

Rationale

SOTA hyperparameters are tuned for H100 large-batch training (524K tokens). At our small-batch MLX runs (8K tokens), they're too aggressive:

  • Matrix LR 0.022 → 0.02: less aggressive update magnitude
  • Muon momentum 0.99 → 0.95: less backward-looking at noisy small-batch gradients
  • Muon momentum warmup 0.95 → 0.90: slower warmup reduces early spikes
  • QK-Gain 5.25 → 4.0: softer attention (pairs with Partial RoPE 16d — too-sharp overreacts to non-rotated content dims)

3-AI Methodology

Claude + Gemini (3 Pro) + Codex (GPT-5 Codex) independently recommended these exact hyperparameters via theoretical analysis, then empirically validated with 10+ local experiments.

Test plan

  • Local MLX A/B test (EXP-042 vs EXP-048, 5000 steps each)
  • Experimental logs included (exp042_train.log, exp048_train.log)
  • Reproducible script (run_experiment.sh)
  • H100 20K-step 3-seed validation (seeds 42, 314, 999) — pending compute credits
  • Combined with SOTA architecture stack

Relation to open PR #1595

PR #1595 (3x MLP + QAT) is my previous non-record submission. This new submission demonstrates a different (simpler) technique with larger measured improvement. I'll update or close PR #1595 after reviewer feedback on this one.

🤖 Generated with Claude Code

…ing)

Pure hyperparameter tuning gives -0.05 BPB at 5000-step MLX scale:
- Matrix LR 0.022 → 0.02
- Muon momentum 0.99 → 0.95
- Muon warmup start 0.95 → 0.90
- QK-Gain 5.25 → 4.0

Same architecture, same training config, different hyperparameters.

Evidence:
- EXP-042 (SOTA defaults): val_bpb = 1.5596
- EXP-048 (tuned):         val_bpb = 1.5096  ← -0.050 BPB

Local MLX only (M5 MacBook, 40M tokens). H100 3-seed validation
pending compute credit approval.

Rationale: SOTA hyperparameters are tuned for H100 large-batch
(524K) training. At our small-batch MLX (8K), they're too aggressive.
Lower Muon momentum and softer QK-Gain (pairs with Partial RoPE) help.

3-AI methodology: Claude + Gemini + Codex independently recommended
these exact values via theoretical analysis, then empirically validated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seekerPrice added a commit to seekerPrice/parameter-golf that referenced this pull request Apr 14, 2026
…rallel residuals + tuned hparams)

Ports the MLX-validated recipe from PR openai#1612 (val_bpb=1.5096 local) to
PyTorch/CUDA for H100 validation. All new features are opt-in via env vars;
without any env vars set, behavior is identical to upstream train_gpt.py.

New env var toggles:
- RECUR_LAYERS: comma-separated physical layer indices to reuse (e.g., "3,4,5")
- RECUR_START_STEP: step at which to activate recurrence (0 = from init)
- PARALLEL_RESIDUAL: enable GPT-J style parallel attn+MLP (0/1)
- PARALLEL_START_LAYER: first physical layer to use parallel residuals

Code changes (additive, non-breaking):
- Hyperparameters: 4 new env-var fields
- Block.forward: accepts parallel=bool, branches between sequential/parallel residuals
- GPT.__init__: builds virtual-to-physical layer mapping
- GPT.set_recurrence_active: toggles recurrence on/off
- GPT.forward: iterates virtual layers via v2p map
- Training loop: activates recurrence at RECUR_START_STEP

Tuned hyperparameters (PR openai#1612) work via existing env vars:
  MATRIX_LR=0.02 MUON_MOMENTUM=0.95 QK_GAIN_INIT=4.0

Files:
- train_gpt.py (1194 lines, +68 vs upstream)
- run_sota_match.sh: reference run with SOTA defaults
- run_tuned_hparams.sh: our tuned config for H100 validation
- README.md, submission.json

Status: code ready, syntax-checked. Awaiting Quick-start H100 credits
to validate hyperparameter transfer from 40M → 2.4B tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant