Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0 by BiggerDABOSS · Pull Request #1276 · openai/parameter-golf

BiggerDABOSS · 2026-04-03T01:43:28Z

Record Submission: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0

Target: ~1.110 val_bpb | 8xH100 SXM | <16 MB artifact

Summary

Four orthogonal improvements stacked on PR #549 (1.1194 BPB):

MuonEq-R (-0.001 BPB): Row-normalizes gradient matrices before Newton-Schulz orthogonalization (arXiv:2603.28254). Equalizes row norms so NS operates on better-conditioned matrices. Zero additional bytes.
Context-Only SLOT (-0.006 BPB): Per-batch additive delta vector (512d) optimized with AdamW on context-only positions during sliding-window eval. Delta is re-initialized to zeros each window. Causal by construction — new tokens excluded from the optimization loss. (Hu et al. arXiv:2505.12392v2, PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217)
XSA all 11 layers (-0.001 BPB): Extended cross-sequence attention from last 4 layers to all 11. No new parameters. (PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019)
QK_GAIN_INIT=5.0 (-0.001 BPB): Increased QK gain from 1.5 to 5.0 per PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 sweep results.

Architecture (PR #549 stack)

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3x with LeakyReLU(0.5)^2
BigramHash	1536
XSA	All 11 layers
RoPE	Partial (16/64 dims)
LN Scale	1/sqrt(layer+1)
VE128	Layers 9-10
QK Gain	5.0
Weight avg	EMA(0.997) + Tight SWA(every 50)
Quantization	GPTQ-lite int6 + lzma
Optimizer	MuonEq-R + Parallel Muon

Legality

MuonEq-R: standard optimizer improvement
Context-Only SLOT: causal — delta optimized on past tokens only, new tokens excluded from loss
XSA-all: no new parameters, architectural choice
QK_GAIN=5.0: hyperparameter choice
Score-first TTT follows PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 legal protocol
No n-gram cache, no two-pass rescoring, no eval-time GPTQ

Credits

Base model + TTT: PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@abaybektursun), PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush), PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 (@Christopher-Lee-McClendon)
MuonEq-R: arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260
SLOT: Hu et al. arXiv:2505.12392v2, PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 (@dexhunter)
QK-Gain sweep: PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217
XSA-all: PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Files

README.md — detailed description
submission.json — metadata
train_gpt.py — full training + eval script
run.sh — launch script with all env vars

Built on the PR openai#549 stack (1.1194 BPB). Adds MuonEq-R optimizer (row-normalize before Newton-Schulz), Context-Only SLOT (causal per-window delta optimization on past tokens), XSA on all 11 layers (was 4), and QK_GAIN_INIT=5.0. Expected ~1.110 BPB on 8xH100 SXM. Made-with: Cursor

BiggerDABOSS force-pushed the submission/muoneqr-slot-xsa11-qkgain5 branch from 9c4d846 to 73de0be Compare April 3, 2026 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0#1276

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0#1276
BiggerDABOSS wants to merge 1 commit intoopenai:mainfrom
BiggerDABOSS:submission/muoneqr-slot-xsa11-qkgain5

BiggerDABOSS commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BiggerDABOSS commented Apr 3, 2026

Record Submission: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0

Summary

Architecture (PR #549 stack)

Legality

Credits

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant