Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612
Open
seekerPrice wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice wants to merge 1 commit intoopenai:mainfrom
Conversation
…ing) Pure hyperparameter tuning gives -0.05 BPB at 5000-step MLX scale: - Matrix LR 0.022 → 0.02 - Muon momentum 0.99 → 0.95 - Muon warmup start 0.95 → 0.90 - QK-Gain 5.25 → 4.0 Same architecture, same training config, different hyperparameters. Evidence: - EXP-042 (SOTA defaults): val_bpb = 1.5596 - EXP-048 (tuned): val_bpb = 1.5096 ← -0.050 BPB Local MLX only (M5 MacBook, 40M tokens). H100 3-seed validation pending compute credit approval. Rationale: SOTA hyperparameters are tuned for H100 large-batch (524K) training. At our small-batch MLX (8K), they're too aggressive. Lower Muon momentum and softer QK-Gain (pairs with Partial RoPE) help. 3-AI methodology: Claude + Gemini + Codex independently recommended these exact values via theoretical analysis, then empirically validated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
seekerPrice
added a commit
to seekerPrice/parameter-golf
that referenced
this pull request
Apr 14, 2026
…rallel residuals + tuned hparams) Ports the MLX-validated recipe from PR openai#1612 (val_bpb=1.5096 local) to PyTorch/CUDA for H100 validation. All new features are opt-in via env vars; without any env vars set, behavior is identical to upstream train_gpt.py. New env var toggles: - RECUR_LAYERS: comma-separated physical layer indices to reuse (e.g., "3,4,5") - RECUR_START_STEP: step at which to activate recurrence (0 = from init) - PARALLEL_RESIDUAL: enable GPT-J style parallel attn+MLP (0/1) - PARALLEL_START_LAYER: first physical layer to use parallel residuals Code changes (additive, non-breaking): - Hyperparameters: 4 new env-var fields - Block.forward: accepts parallel=bool, branches between sequential/parallel residuals - GPT.__init__: builds virtual-to-physical layer mapping - GPT.set_recurrence_active: toggles recurrence on/off - GPT.forward: iterates virtual layers via v2p map - Training loop: activates recurrence at RECUR_START_STEP Tuned hyperparameters (PR openai#1612) work via existing env vars: MATRIX_LR=0.02 MUON_MOMENTUM=0.95 QK_GAIN_INIT=4.0 Files: - train_gpt.py (1194 lines, +68 vs upstream) - run_sota_match.sh: reference run with SOTA defaults - run_tuned_hparams.sh: our tuned config for H100 validation - README.md, submission.json Status: code ready, syntax-checked. Awaiting Quick-start H100 credits to validate hyperparameter transfer from 40M → 2.4B tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A/B Evidence
Rationale
SOTA hyperparameters are tuned for H100 large-batch training (524K tokens). At our small-batch MLX runs (8K tokens), they're too aggressive:
3-AI Methodology
Claude + Gemini (3 Pro) + Codex (GPT-5 Codex) independently recommended these exact hyperparameters via theoretical analysis, then empirically validated with 10+ local experiments.
Test plan
Relation to open PR #1595
PR #1595 (3x MLP + QAT) is my previous non-record submission. This new submission demonstrates a different (simpler) technique with larger measured improvement. I'll update or close PR #1595 after reviewer feedback on this one.
🤖 Generated with Claude Code