Skip to content

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130

Open
Gusanidas wants to merge 2 commits intoopenai:mainfrom
Gusanidas:alejandro/ksv2-improved-2-clean
Open

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)#1130
Gusanidas wants to merge 2 commits intoopenai:mainfrom
Gusanidas:alejandro/ksv2-improved-2-clean

Conversation

@Gusanidas
Copy link
Copy Markdown

@Gusanidas Gusanidas commented Mar 30, 2026

Record: Kitchen Sink V2 — val_bpb 1.1140 (12-seed mean, std 0.0005)

val_bpb: 1.1140 | val_loss: 1.8809 nats | ~15.88 MB | 8×H100 SXM | No TTT

Built on PR #549 by @abaybektursun. 12-seed validation, all artifacts under 16,000,000 bytes, all training under 600s.

Results (12 seeds, sliding window eval, stride=64)

Seed val_loss (nats) val_bpb Artifact (bytes)
2 1.8793 1.1130 15,869,516
9999 1.8800 1.1134 15,784,368
22 1.8801 1.1135 15,856,224
7 1.8807 1.1139 15,745,368
1337 1.8808 1.1139 15,806,284
2222 1.8807 1.1139 15,689,632
99 1.8808 1.1139 15,872,092
77 1.8815 1.1143 15,723,072
2026 1.8814 1.1143 15,751,888
42 1.8817 1.1145 15,736,768
777 1.8818 1.1145 15,884,408
222 1.8820 1.1147 15,734,064
Mean 1.8809 1.1140
Std 0.0008 0.0005

Statistical significance vs SOTA (PR #549, 1.8843 nats)

  • Δ = 0.0091 nats (threshold: 0.005)
  • Welch t-test: p < 0.0001

What's new (over PR #549)

  1. Residual lambdas — learnable per-sublayer residual scaling (init √1.1 ≈ 1.049, 5× scalar LR, no WD). Creates exponential recency bias across layers. From modded-nanogpt; novel in parameter-golf.
  2. Split early/late LR banks — layers 0–5 and 6–10 get separate Muon/Adam learning rates (matrix: 0.036/0.044, scalar: 0.028/0.018). Later layers benefit from higher LR.
  3. Train-data GPTQ within training budget — reserves 14s from 600s for Hessian collection + Cholesky error compensation. Unambiguously legal (PR Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 approach).
  4. Coprime-stride data loader — multi-shard sampling with coprime-stride block traversal for batch diversity (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 / Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 style).
  5. Bigger BigramHash — 6144 buckets (up from 1536), reducing hash collision ratio.
  6. Bigger Value Embeddings — dim=196 on layers 5,9,10 (up from dim=128 on layers 9,10).
  7. XSA on last 7 layers (up from 4).
  8. MiLe margin loss — entropy-weighted cross-entropy (gamma=0.75), disabled during warmdown.
  9. Cache + backout — layer 7 hidden state cached, subtracted via learnable gate before LM head.
  10. Flash Attention 3 via flash_attn_interface.
  11. No TTT — sliding window eval only (~98s), leaving eval budget unused.
    12 Tuned batch size — TRAIN_BATCH_TOKENS=548,864

Architecture

Component Setting
Layers 11 (512d, 8H, 4KV)
MLP 3× with LeakyReLU(0.5)²
BigramHash 6144 buckets
XSA Last 7 layers
RoPE Partial (16/64 dims)
LN Scale 1/√(layer+1)
VE196 Layers 5, 9, 10
Residual lambdas Per-sublayer, init √1.1
Cache + backout Layer 7, learnable λ
Weight avg EMA(0.997) + SWA(every 50)
Quantization Full Hessian GPTQ int6 + LZMA
Optimizer Parallel Muon (split early/late LR)
Late QAT STE at lr_scale < 0.15
Params 27,605,108

Timing

Phase Time
Training (incl. 14s GPTQ calibration) 586s + 14s = 600s
Sliding window eval (stride=64) ~98s
Total eval ~98s

Credits

Gusanidas and others added 2 commits March 30, 2026 10:42
PR openai#549 / KitchenSinkV2 base with:
- Residual lambdas: learnable per-sublayer scaling (init sqrt(1.1), 5x LR)
- Bigram hash: 6144 buckets (up from 2048)
- Value embeddings: dim=196 on layers 5,9,10
- Flash Attention 3 via flash_attn_interface
- Train-data GPTQ int6 calibration within training budget
- Sliding window eval stride=64
- Optuna-tuned LRs: matrix 0.036/0.044, scalar 0.028/0.018

12 seeds: mean 1.1140 bpb (1.8809 nats), std 0.0005
Improvement over leader: 0.0054 bpb / 0.0091 nats
p < 0.0001 for >= 0.005 nats improvement

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT

Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001

Novel contribution: EGGROLL Antithetic Ternary Bin Search — post-GPTQ
quantization refinement that directly optimizes INT6 bin assignments
against BPB loss during eval. Zeroth-order, strictly additive (cannot
degrade quality), complementary to Hessian-based GPTQ.

Also adds missing TTT call to PR openai#1130's eval pipeline.

Built on PR openai#1130 by @Gusanidas (Kitchen Sink V2)
Foundation: PR openai#549 by @abaybektursun

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
… BPB)

ResidLambdas: per-sublayer residual scaling (init sqrt(1.1), 5x scalar_lr, no WD)
Tuned LRs: MATRIX_LR=0.036, SCALAR_LR=0.028, TIED_EMBED_LR=0.022
Bigger VE: dim=196 on layers 5,9,10 (was dim=128 on layers 9,10)
PR openai#1130 achieved 1.1140 (12-seed mean) with these innovations.
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT

Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001

Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement
Also: adds missing TTT call to PR openai#1130 eval pipeline

Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
val_bpb: 1.1161 | val_loss: 1.884 nats | ~15.3 MB | 8×H100 SXM | Legal TTT

Seeds: 42=1.1163, 1337=1.1160, 2024=1.1161 | Mean=1.1161, Std=0.0001

Novel: EGGROLL Antithetic Ternary Bin Search — post-GPTQ bin refinement
Also: adds missing TTT call to PR openai#1130 eval pipeline

Built on PR openai#1130 by @Gusanidas, PR openai#549 by @abaybektursun

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant