Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) by Gusanidas · Pull Request #1130 · openai/parameter-golf

Gusanidas · 2026-03-30T11:03:22Z

Record: Kitchen Sink V2 — val_bpb 1.1140 (12-seed mean, std 0.0005)

val_bpb: 1.1140 | val_loss: 1.8809 nats | ~15.88 MB | 8×H100 SXM | No TTT

Built on PR #549 by @abaybektursun. 12-seed validation, all artifacts under 16,000,000 bytes, all training under 600s.

Results (12 seeds, sliding window eval, stride=64)

Seed	val_loss (nats)	val_bpb	Artifact (bytes)
2	1.8793	1.1130	15,869,516
9999	1.8800	1.1134	15,784,368
22	1.8801	1.1135	15,856,224
7	1.8807	1.1139	15,745,368
1337	1.8808	1.1139	15,806,284
2222	1.8807	1.1139	15,689,632
99	1.8808	1.1139	15,872,092
77	1.8815	1.1143	15,723,072
2026	1.8814	1.1143	15,751,888
42	1.8817	1.1145	15,736,768
777	1.8818	1.1145	15,884,408
222	1.8820	1.1147	15,734,064
Mean	1.8809	1.1140
Std	0.0008	0.0005

Statistical significance vs SOTA (PR #549, 1.8843 nats)

Δ = 0.0091 nats (threshold: 0.005)
Welch t-test: p < 0.0001

What's new (over PR #549)

Residual lambdas — learnable per-sublayer residual scaling (init √1.1 ≈ 1.049, 5× scalar LR, no WD). Creates exponential recency bias across layers. From modded-nanogpt; novel in parameter-golf.
Split early/late LR banks — layers 0–5 and 6–10 get separate Muon/Adam learning rates (matrix: 0.036/0.044, scalar: 0.028/0.018). Later layers benefit from higher LR.
Train-data GPTQ within training budget — reserves 14s from 600s for Hessian collection + Cholesky error compensation. Unambiguously legal (PR Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 approach).
Coprime-stride data loader — multi-shard sampling with coprime-stride block traversal for batch diversity (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 / Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 style).
Bigger BigramHash — 6144 buckets (up from 1536), reducing hash collision ratio.
Bigger Value Embeddings — dim=196 on layers 5,9,10 (up from dim=128 on layers 9,10).
XSA on last 7 layers (up from 4).
MiLe margin loss — entropy-weighted cross-entropy (gamma=0.75), disabled during warmdown.
Cache + backout — layer 7 hidden state cached, subtracted via learnable gate before LM head.
Flash Attention 3 via flash_attn_interface.
No TTT — sliding window eval only (~98s), leaving eval budget unused.
12 Tuned batch size — TRAIN_BATCH_TOKENS=548,864

Architecture

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3× with LeakyReLU(0.5)²
BigramHash	6144 buckets
XSA	Last 7 layers
RoPE	Partial (16/64 dims)
LN Scale	1/√(layer+1)
VE196	Layers 5, 9, 10
Residual lambdas	Per-sublayer, init √1.1
Cache + backout	Layer 7, learnable λ
Weight avg	EMA(0.997) + SWA(every 50)
Quantization	Full Hessian GPTQ int6 + LZMA
Optimizer	Parallel Muon (split early/late LR)
Late QAT	STE at lr_scale < 0.15
Params	27,605,108

Timing

Phase	Time
Training (incl. 14s GPTQ calibration)	586s + 14s = 600s
Sliding window eval (stride=64)	~98s
Total eval	~98s

Credits

Base model (11L 512d XSA EMA): PR #414 by @signalrush
LeakyReLU² activation: PR #493 by @parinzee, PR #518 by @sofiabod
Parallel Muon + Parameter Banking: PR #399 by @abaybektursun
XSA (Exclusive Self-Attention): PR #287 by @jfprincz, PR #478 by @gowtham0992
Partial RoPE + LN Scale: PR #315 by @jfprincz
SmearGate: PR #65 by @aquariouseworkman
BigramHash: PR #162 by @raahilshah
Value Embeddings: PR #374 by @unnir
EMA weight averaging: PR #401 by @newjordan
Late QAT: PR #286 by @chris-buckley
Flash Attention 3: PR #122 by @mtybadger
Coprime-stride data loader: PR #726 by @DeepReinforce, PR #1060 by @dexhunter
Full Hessian GPTQ: PR #535 by @raahilshah
Residual lambdas: modded-nanogpt by @KellerJordan
SOTA stack integration: PR #549 by @abaybektursun

PR openai#549 / KitchenSinkV2 base with: - Residual lambdas: learnable per-sublayer scaling (init sqrt(1.1), 5x LR) - Bigram hash: 6144 buckets (up from 2048) - Value embeddings: dim=196 on layers 5,9,10 - Flash Attention 3 via flash_attn_interface - Train-data GPTQ int6 calibration within training budget - Sliding window eval stride=64 - Optuna-tuned LRs: matrix 0.036/0.044, scalar 0.028/0.018 12 seeds: mean 1.1140 bpb (1.8809 nats), std 0.0005 Improvement over leader: 0.0054 bpb / 0.0091 nats p < 0.0001 for >= 0.005 nats improvement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel contribution: EGGROLL Antithetic Ternary Bin Search — post-GPTQ quantization refinement that directly optimizes INT6 bin assignments against BPB loss during eval. Zeroth-order, strictly additive (cannot degrade quality), complementary to Hessian-based GPTQ. Also adds missing TTT call to PR openai#1130's eval pipeline. Built on PR openai#1130 by @Gusanidas (Kitchen Sink V2) Foundation: PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… BPB) ResidLambdas: per-sublayer residual scaling (init sqrt(1.1), 5x scalar_lr, no WD) Tuned LRs: MATRIX_LR=0.036, SCALAR_LR=0.028, TIED_EMBED_LR=0.022 Bigger VE: dim=196 on layers 5,9,10 (was dim=128 on layers 9,10) PR openai#1130 achieved 1.1140 (12-seed mean) with these innovations.

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement Also: adds missing TTT call to PR openai#1130 eval pipeline Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement Also: adds missing TTT call to PR openai#1130 eval pipeline Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gusanidas and others added 2 commits March 30, 2026 10:42

Fix README: LeakyReLU squared, credit PR openai#1060 for GPTQ

521f148

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

haikosys mentioned this pull request Mar 30, 2026

Record: EGGROLL v2 — val_bpb 1.1161 (3-seed mean, std 0.0001) #1156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130
Gusanidas wants to merge 2 commits intoopenai:mainfrom
Gusanidas:alejandro/ksv2-improved-2-clean

Gusanidas commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gusanidas commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Kitchen Sink V2 — val_bpb 1.1140 (12-seed mean, std 0.0005)

Results (12 seeds, sliding window eval, stride=64)

Statistical significance vs SOTA (PR #549, 1.8843 nats)

What's new (over PR #549)

Architecture

Timing

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gusanidas commented Mar 30, 2026 •

edited

Loading