Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) by Gusanidas · Pull Request #1130 · openai/parameter-golf

Gusanidas · 2026-03-30T11:03:22Z

Record: Kitchen Sink V2 — val_bpb 1.1140 (12-seed mean, std 0.0005)

val_bpb: 1.1140 | val_loss: 1.8809 nats | ~15.88 MB | 8×H100 SXM | No TTT

Built on PR #549 by @abaybektursun. 12-seed validation, all artifacts under 16,000,000 bytes, all training under 600s.

Results (12 seeds, sliding window eval, stride=64)

Seed	val_loss (nats)	val_bpb	Artifact (bytes)
2	1.8793	1.1130	15,869,516
9999	1.8800	1.1134	15,784,368
22	1.8801	1.1135	15,856,224
7	1.8807	1.1139	15,745,368
1337	1.8808	1.1139	15,806,284
2222	1.8807	1.1139	15,689,632
99	1.8808	1.1139	15,872,092
77	1.8815	1.1143	15,723,072
2026	1.8814	1.1143	15,751,888
42	1.8817	1.1145	15,736,768
777	1.8818	1.1145	15,884,408
222	1.8820	1.1147	15,734,064
Mean	1.8809	1.1140
Std	0.0008	0.0005

Statistical significance vs SOTA (PR #549, 1.8843 nats)

Δ = 0.0091 nats (threshold: 0.005)
Welch t-test: p < 0.0001

What's new (over PR #549)

Residual lambdas — learnable per-sublayer residual scaling (init √1.1 ≈ 1.049, 5× scalar LR, no WD). Creates exponential recency bias across layers. From modded-nanogpt; novel in parameter-golf.
Split early/late LR banks — layers 0–5 and 6–10 get separate Muon/Adam learning rates (matrix: 0.036/0.044, scalar: 0.028/0.018). Later layers benefit from higher LR.
Train-data GPTQ within training budget — reserves 14s from 600s for Hessian collection + Cholesky error compensation. Unambiguously legal (PR Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 approach).
Coprime-stride data loader — multi-shard sampling with coprime-stride block traversal for batch diversity (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 / Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 style).
Bigger BigramHash — 6144 buckets (up from 1536), reducing hash collision ratio.
Bigger Value Embeddings — dim=196 on layers 5,9,10 (up from dim=128 on layers 9,10).
XSA on last 7 layers (up from 4).
MiLe margin loss — entropy-weighted cross-entropy (gamma=0.75), disabled during warmdown.
Cache + backout — layer 7 hidden state cached, subtracted via learnable gate before LM head.
Flash Attention 3 via flash_attn_interface.
No TTT — sliding window eval only (~98s), leaving eval budget unused.
12 Tuned batch size — TRAIN_BATCH_TOKENS=548,864

Architecture

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3× with LeakyReLU(0.5)²
BigramHash	6144 buckets
XSA	Last 7 layers
RoPE	Partial (16/64 dims)
LN Scale	1/√(layer+1)
VE196	Layers 5, 9, 10
Residual lambdas	Per-sublayer, init √1.1
Cache + backout	Layer 7, learnable λ
Weight avg	EMA(0.997) + SWA(every 50)
Quantization	Full Hessian GPTQ int6 + LZMA
Optimizer	Parallel Muon (split early/late LR)
Late QAT	STE at lr_scale < 0.15
Params	27,605,108

Timing

Phase	Time
Training (incl. 14s GPTQ calibration)	586s + 14s = 600s
Sliding window eval (stride=64)	~98s
Total eval	~98s

Credits

Base model (11L 512d XSA EMA): PR #414 by @signalrush
LeakyReLU² activation: PR #493 by @parinzee, PR #518 by @sofiabod
Parallel Muon + Parameter Banking: PR #399 by @abaybektursun
XSA (Exclusive Self-Attention): PR #287 by @jfprincz, PR #478 by @gowtham0992
Partial RoPE + LN Scale: PR #315 by @jfprincz
SmearGate: PR #65 by @aquariouseworkman
BigramHash: PR #162 by @raahilshah
Value Embeddings: PR #374 by @unnir
EMA weight averaging: PR #401 by @newjordan
Late QAT: PR #286 by @chris-buckley
Flash Attention 3: PR #122 by @mtybadger
Coprime-stride data loader: PR #726 by @DeepReinforce, PR #1060 by @dexhunter
Full Hessian GPTQ: PR #535 by @raahilshah
Residual lambdas: modded-nanogpt by @KellerJordan
SOTA stack integration: PR #549 by @abaybektursun

PR openai#549 / KitchenSinkV2 base with: - Residual lambdas: learnable per-sublayer scaling (init sqrt(1.1), 5x LR) - Bigram hash: 6144 buckets (up from 2048) - Value embeddings: dim=196 on layers 5,9,10 - Flash Attention 3 via flash_attn_interface - Train-data GPTQ int6 calibration within training budget - Sliding window eval stride=64 - Optuna-tuned LRs: matrix 0.036/0.044, scalar 0.028/0.018 12 seeds: mean 1.1140 bpb (1.8809 nats), std 0.0005 Improvement over leader: 0.0054 bpb / 0.0091 nats p < 0.0001 for >= 0.005 nats improvement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel contribution: EGGROLL Antithetic Ternary Bin Search — post-GPTQ quantization refinement that directly optimizes INT6 bin assignments against BPB loss during eval. Zeroth-order, strictly additive (cannot degrade quality), complementary to Hessian-based GPTQ. Also adds missing TTT call to PR openai#1130's eval pipeline. Built on PR openai#1130 by @Gusanidas (Kitchen Sink V2) Foundation: PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… BPB) ResidLambdas: per-sublayer residual scaling (init sqrt(1.1), 5x scalar_lr, no WD) Tuned LRs: MATRIX_LR=0.036, SCALAR_LR=0.028, TIED_EMBED_LR=0.022 Bigger VE: dim=196 on layers 5,9,10 (was dim=128 on layers 9,10) PR openai#1130 achieved 1.1140 (12-seed mean) with these innovations.

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement Also: adds missing TTT call to PR openai#1130 eval pipeline Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Gusanidas

val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001 Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement Also: adds missing TTT call to PR openai#1130 eval pipeline Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:05:28Z

Community Review — Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)

BPB: 1.1140 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 521f1480ef87, file records/track_10min_16mb/2026-03-29_KitchenSinkV2/train_gpt.py):

The TTT path at line 1311 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=126292 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=126292 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Gusanidas and others added 2 commits March 30, 2026 10:42

Fix README: LeakyReLU squared, credit PR openai#1060 for GPTQ

521f148

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

haikosys mentioned this pull request Mar 30, 2026

Record: EGGROLL v2 — val_bpb 1.1161 (3-seed mean, std 0.0001) #1156

Open

notapplica mentioned this pull request Mar 31, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Apr 1, 2026

Fix base PR reference: openai#1130 not openai#1179

2fc09fc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130
Gusanidas wants to merge 2 commits intoopenai:mainfrom
Gusanidas:alejandro/ksv2-improved-2-clean

Gusanidas commented Mar 30, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gusanidas commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Kitchen Sink V2 — val_bpb 1.1140 (12-seed mean, std 0.0005)

Results (12 seeds, sliding window eval, stride=64)

Statistical significance vs SOTA (PR #549, 1.8843 nats)

What's new (over PR #549)

Architecture

Timing

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gusanidas commented Mar 30, 2026 •

edited

Loading