Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending) by seekerPrice · Pull Request #1612 · openai/parameter-golf

seekerPrice · 2026-04-14T08:33:41Z

Summary

Pure hyperparameter tuning gives -0.0500 BPB at 5000-step MLX scale (local A/B test)
Same architecture, same training config, only 4 hyperparameters changed
Local MLX only (M5 MacBook Pro) — H100 3-seed validation pending compute credit approval

A/B Evidence

Experiment	Matrix LR	Muon Momentum	QK-Gain	val_bpb
EXP-042 (SOTA defaults)	0.022	0.99	5.25	1.5596
EXP-048 (tuned)	0.02	0.95	4.0	1.5096
			Δ:	-0.0500

Rationale

SOTA hyperparameters are tuned for H100 large-batch training (524K tokens). At our small-batch MLX runs (8K tokens), they're too aggressive:

Matrix LR 0.022 → 0.02: less aggressive update magnitude
Muon momentum 0.99 → 0.95: less backward-looking at noisy small-batch gradients
Muon momentum warmup 0.95 → 0.90: slower warmup reduces early spikes
QK-Gain 5.25 → 4.0: softer attention (pairs with Partial RoPE 16d — too-sharp overreacts to non-rotated content dims)

3-AI Methodology

Claude + Gemini (3 Pro) + Codex (GPT-5 Codex) independently recommended these exact hyperparameters via theoretical analysis, then empirically validated with 10+ local experiments.

Test plan

Local MLX A/B test (EXP-042 vs EXP-048, 5000 steps each)
Experimental logs included (exp042_train.log, exp048_train.log)
Reproducible script (run_experiment.sh)
H100 20K-step 3-seed validation (seeds 42, 314, 999) — pending compute credits
Combined with SOTA architecture stack

Relation to open PR #1595

PR #1595 (3x MLP + QAT) is my previous non-record submission. This new submission demonstrates a different (simpler) technique with larger measured improvement. I'll update or close PR #1595 after reviewer feedback on this one.

🤖 Generated with Claude Code

…ing) Pure hyperparameter tuning gives -0.05 BPB at 5000-step MLX scale: - Matrix LR 0.022 → 0.02 - Muon momentum 0.99 → 0.95 - Muon warmup start 0.95 → 0.90 - QK-Gain 5.25 → 4.0 Same architecture, same training config, different hyperparameters. Evidence: - EXP-042 (SOTA defaults): val_bpb = 1.5596 - EXP-048 (tuned): val_bpb = 1.5096 ← -0.050 BPB Local MLX only (M5 MacBook, 40M tokens). H100 3-seed validation pending compute credit approval. Rationale: SOTA hyperparameters are tuned for H100 large-batch (524K) training. At our small-batch MLX (8K), they're too aggressive. Lower Muon momentum and softer QK-Gain (pairs with Partial RoPE) help. 3-AI methodology: Claude + Gemini + Codex independently recommended these exact values via theoretical analysis, then empirically validated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rallel residuals + tuned hparams) Ports the MLX-validated recipe from PR openai#1612 (val_bpb=1.5096 local) to PyTorch/CUDA for H100 validation. All new features are opt-in via env vars; without any env vars set, behavior is identical to upstream train_gpt.py. New env var toggles: - RECUR_LAYERS: comma-separated physical layer indices to reuse (e.g., "3,4,5") - RECUR_START_STEP: step at which to activate recurrence (0 = from init) - PARALLEL_RESIDUAL: enable GPT-J style parallel attn+MLP (0/1) - PARALLEL_START_LAYER: first physical layer to use parallel residuals Code changes (additive, non-breaking): - Hyperparameters: 4 new env-var fields - Block.forward: accepts parallel=bool, branches between sequential/parallel residuals - GPT.__init__: builds virtual-to-physical layer mapping - GPT.set_recurrence_active: toggles recurrence on/off - GPT.forward: iterates virtual layers via v2p map - Training loop: activates recurrence at RECUR_START_STEP Tuned hyperparameters (PR openai#1612) work via existing env vars: MATRIX_LR=0.02 MUON_MOMENTUM=0.95 QK_GAIN_INIT=4.0 Files: - train_gpt.py (1194 lines, +68 vs upstream) - run_sota_match.sh: reference run with SOTA defaults - run_tuned_hparams.sh: our tuned config for H100 validation - README.md, submission.json Status: code ready, syntax-checked. Awaiting Quick-start H100 credits to validate hyperparameter transfer from 40M → 2.4B tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

seekerPrice mentioned this pull request Apr 14, 2026

Non-record: 3x MLP + Quantization-Aware Training (STE) #1595

Closed

4 tasks

seekerPrice mentioned this pull request Apr 14, 2026

Non-record: CUDA port of PR #1612 recipe (H100 pending) #1614

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)#1612
seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice:submission/mlx-tuned-hparams-1.5096

seekerPrice commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seekerPrice commented Apr 14, 2026

Summary

A/B Evidence

Rationale

3-AI Methodology

Test plan

Relation to open PR #1595

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant