Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) by dexhunter · Pull Request #1285 · openai/parameter-golf

dexhunter · 2026-04-03T05:33:56Z

Summary

val_bpb = 1.0912 (3-seed mean, std 0.0009) | 2.5106 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
WD-quantization synergy: higher weight decay (0.090) compresses 5% better, allowing ALL 66 layers at int6
All seeds under 16MB with 32K+ margins
No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: WD-Quantization Synergy

Higher WD (0.090 vs 0.085) → smaller weights → 5% better brotli compression → enough headroom for ALL 66 layers at int6 precision. The quantization quality gain exceeds the WD BPP cost:

Config	WD	N_INT6	Artifact	val_bpb (s42)
PR #1260	0.085	60	15,981K	1.09217
PR #1279	0.085	61	15,997K	1.09170
This	0.090	66	15,967K	1.09057

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
42	5,540	106.5	1.0906	2.50910	15,967,483
0	5,536	106.6	1.0908	2.50973	15,962,242
1337	5,538	106.6	1.0923	2.51309	15,959,253
Mean	5,538	106.6	1.0912	2.51064	15,962,993

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09124 (-0.00661)
Weight decay	0.085	0.090
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5 repeated
Quantization	Mixed	All int6 (66/66)

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + high-WD architecture)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)
@dexhunter for PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260/Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279 (MuonEq-R + recurrence + mixed quant)

Test plan

3-seed verification (42, 0, 1337) — all pass
All under 16MB (min margin: 32,517)
4-seed tested (seed 7 also fits at 15,970,676)
No TTT, no SLOT

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

This was referenced Apr 3, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1296

Open

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6

dexhunter commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 3, 2026

Summary

Key Innovation: WD-Quantization Synergy

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant