Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT by Christopher-Lee-McClendon · Pull Request #598 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-24T04:54:11Z

Non-Record: 4×A100-40GB

val_bpb = 1.1334 | Pre-TTT: 1.1476 | Artifact: 15.70 MB (headroom: 297 KB)

What Changed

Extended training to 7000 steps (from the typical 5200) with a longer warmdown cosine anneal (step 3500→7000), combined with mixed int6/int8 quantization to keep the 27M-parameter model under 16 MB. Legal score-first TTT (10 epochs SGD with momentum) yields a further −0.0142 BPB improvement.

	This Work	Prior VE128+RoPE (30ep TTT)
Pre-TTT BPB	1.1476	1.1609
Post-TTT BPB	1.1334	1.1425
Training steps	7,000	5,200
TTT epochs	10	30
Eval time	2,194s	3,662s
Artifact size	15.70 MB	15.65 MB

The base model improvement (−0.0133 pre-TTT) comes from longer training plus the GEPA architecture. Fewer TTT epochs (10 vs 30) mean faster eval (40% less wall time) at the cost of a smaller TTT gain (−0.0142 vs −0.0184).

Architecture: 11L GEPA

11 unique layers (no depth recurrence), d=512, 8Q/4KV GQA heads
ReLU² (Star-ReLU) activation in 3× MLP (1536 hidden)
Cross-sequence attention (XSA) on last 4 layers
Exponential moving average (decay 0.997)
Bigram hash embeddings (2048 buckets, 128d)
Partial RoPE (16/64 dims) with YARN scaling
Value embeddings (128d) on layers 9–10
U-Net skip connections across layer pairs
LN depth scaling (1/√(layer+1))
Late QAT with GPTQ-lite clip search (5 candidates/row), enabled at step 6476

Mixed Quantization

Dual-scheme compression for the 27M-parameter model:

Int6 per-row (GPTQ-lite): attention projections + MLP weights (bulk of params)
Int8 per-tensor (scalar scale): layer norms, value embeddings, biases, embedding tables

27.5 MB payload → 15.63 MB after zstd-22 (3.89× compression) + 76 KB code = 15.70 MB total.

TTT Protocol (Legal Score-First)

SGD with momentum (0.9) at lr=0.002, 10 epochs per 32K-token chunk, stride=64, freezing first 2 blocks. Score-first: every token scored under torch.inference_mode() before any weight update.

Limitations

Single seed (42) — no variance estimate. Acceptable for non-record but results may vary across seeds.
No ablation of individual GEPA components. The architecture combines multiple techniques without isolating their contributions.
No LeakyReLU: PR Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537 showed LeakyReLU(0.5)² helps (−0.0035 BPB). This submission uses standard ReLU² instead — combining with LeakyReLU is an obvious next step.

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
BigramHash + SmearGate — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
XSA (Cross-Sequence Attention) — PR [Closed] EMA + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) #187 (Idan3011); GQA-aware variant PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (unnir)
U-Net skip connections + ReLU² + int6 quantization — PR SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB) #289 (integrate-your-mind)
Partial RoPE + LN depth scaling — PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (jfprincz)
Shared Value Embeddings (VE128) — PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 (unnir)
EMA + GPTQ-lite clip search + warmdown 3500 + Late QAT@0.15 — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (signalrush)
LeakyReLU² (referenced in Limitations) — PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (abaybektursun)
Tied FP16 embeddings + warmdown — PR fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197) #42 (chonchiog)

And all contributors to the parameter-golf competition.

… BPB) - Non-record submission: 1.1334 BPB, 15.70 MB artifact (4×A100-40GB) - Mixed quantization: int6 per-row for MLP/attn, int8 per-tensor for rest - 7000 training steps (vs 5200 baseline) with GEPA architecture - Legal score-first TTT: SGD 10 epochs, -0.0142 BPB gain - Beats prior non-record best (1.1425) by 0.009 BPB

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt

Christopher-Lee-McClendon commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Lee-McClendon commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: 4×A100-40GB

What Changed

Architecture: 11L GEPA

Mixed Quantization

TTT Protocol (Legal Score-First)

Limitations

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Christopher-Lee-McClendon commented Mar 24, 2026 •

edited

Loading