Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) by dexhunter · Pull Request #1260 · openai/parameter-golf

dexhunter · 2026-04-02T17:35:11Z

Summary

val_bpb = 1.0929 (3-seed mean, std 0.0009) | 2.5145 nats | ~15.96 MB | 8xH100 SXM, 600s | No TTT
Adds MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), and mixed int5/int6 GPTQ to PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218's 4096-vocab high-WD stack
All 3 seeds under 16MB (max: 15,981,324 bytes)
No SLOT, no eval-time adaptation, fully legal

Key Innovations

MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
1337	5,541	106.5	1.0939	2.51667	15,933,457
42	5,530	106.7	1.0922	2.51279	15,981,324
0	5,543	106.5	1.0927	2.51394	15,960,050
Mean	5,538	106.6	1.0929	2.51447	15,958,277

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09290 (-0.00495)
val_loss	~2.526 nats	2.514 nats (-0.011)
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5
Mixed quantization	No	60 int6 + 6 int5

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + MLP 4x + WD 0.085 — the foundation)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)

Test plan

3-seed verification (1337, 42, 0) — all pass artifact + time + score
All seeds under 16,000,000 bytes
Train < 600s, eval < 600s
No TTT, no SLOT, no forbidden techniques
Rule checker passed (log + script)

…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.

mikeapedia · 2026-04-02T20:56:46Z

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Row-normalize the gradient update before Newton-Schulz orthogonalization. From PR openai#1260: ~0.001 BPB free improvement, zero extra parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

… (3-seed mean) Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner (21,396 bytes) that creates enough headroom. 3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7) All seeds under 16MB (max: 15,996,591 bytes) No TTT, no SLOT, no eval-time adaptation. Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression. Built on PR openai#1218 by @clarkkev.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

…on for Muon optimizer From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf PRs, top record PR openai#1260 = val_bpb 1.0929 (3-seed mean). Inserts row normalization between Patch 17 Mousse block and Newton-Schulz: row_norm[i] = sqrt(sum_j G[i,j]^2) G[i,j] = G[i,j] / row_norm[i] Distinct from Mousse: Mousse is row+col (G/||row||/||col||), MuonEq-R is row-only (G/||row||). They can stack independently. Gated by USE_MUONEQ_R=1, falls back gracefully when unset. 4 MR experiments queued for validation: MR0_alone, MR1_plus_leaky_ng, MR2_seed42, MR3_mousse_plus_muoneqr This is the second optimizer-side patch in two fires. Both patches fit our train_loss metric so they can validate on cheap GPU loop without H100 escalation. If either lands within champion noise band 3.27-3.30, defensible ship for final stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

MatoTeziTanka · 2026-04-11T20:04:31Z

Community Review — Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: f-string: expecting '}' (line 563)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: f-string: expecting '}' (line 563). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

BiggerDABOSS mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0 #1276

Open

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279

Open

5 tasks

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285

Merged

4 tasks

This was referenced Apr 4, 2026

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1338

Closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1339

Open

Rome-1 mentioned this pull request Apr 5, 2026

Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B #1389

Open

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

bigbag mentioned this pull request Apr 6, 2026

Record: SP4096 + 3-Layer Recurrence + GPTQ Embeddings + SDClip + ETLB — val_bpb 1.0913 (3-seed mean) #1415

Open

4 tasks

aryanbhosale mentioned this pull request Apr 6, 2026

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423

Open

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

Mertyandimata mentioned this pull request Apr 7, 2026

[Submission] EngramLite + Mousse + Progressive Depth Recurrence + TTT — val_bpb 1.1026 | 15.95MB | 8×H100 #1440

Open

aryanbhosale mentioned this pull request Apr 8, 2026

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean) #1477

Merged

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: SLOT-32 + Partial Depth Recurrence — val_bpb 0.7736 (3-seed mean) #1278

Open

tashapais mentioned this pull request Apr 16, 2026

SP8192 + 4-Layer Depth Recurrence (loop_end=6) #1678

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

dexhunter commented Apr 2, 2026

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dexhunter commented Apr 2, 2026

Summary

Key Innovations

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading