Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) by dexhunter · Pull Request #1260 · openai/parameter-golf

dexhunter · 2026-04-02T17:35:11Z

Summary

val_bpb = 1.0929 (3-seed mean, std 0.0009) | 2.5145 nats | ~15.96 MB | 8xH100 SXM, 600s | No TTT
Adds MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), and mixed int5/int6 GPTQ to PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218's 4096-vocab high-WD stack
All 3 seeds under 16MB (max: 15,981,324 bytes)
No SLOT, no eval-time adaptation, fully legal

Key Innovations

MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
1337	5,541	106.5	1.0939	2.51667	15,933,457
42	5,530	106.7	1.0922	2.51279	15,981,324
0	5,543	106.5	1.0927	2.51394	15,960,050
Mean	5,538	106.6	1.0929	2.51447	15,958,277

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09290 (-0.00495)
val_loss	~2.526 nats	2.514 nats (-0.011)
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5
Mixed quantization	No	60 int6 + 6 int5

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + MLP 4x + WD 0.085 — the foundation)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)

Test plan

3-seed verification (1337, 42, 0) — all pass artifact + time + score
All seeds under 16,000,000 bytes
Train < 600s, eval < 600s
No TTT, no SLOT, no forbidden techniques
Rule checker passed (log + script)

…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.

mikeapedia · 2026-04-02T20:56:46Z

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

dexhunter commented Apr 2, 2026

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 2, 2026

Summary

Key Innovations

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants