-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Parameter Golf Live AI Commentary
Auto-updated every ~10 minutes. Tracking techniques, trends, idea lineage, and explaining concepts for the community.
Last updated: Mar 21, 11:30 AM PT
The Competition at a Glance
Goal: Train the best language model that fits in a 16MB artifact, training in under 10 minutes on 8xH100s. Evaluated by compression of the FineWeb validation set, measured in bits per byte (BPB) — lower is better. Tokenizer-agnostic. Baseline: 1.2244 BPB.
What does "compression" mean here?
BPB (bits per byte) measures how many bits your model needs to encode each byte of text. A model that perfectly predicts every next character needs zero bits — it already "knows" what comes next. A model with no understanding of language needs the maximum (~8 bits per byte).
A model's cross-entropy loss IS its compression rate. Shannon proved in 1948 that prediction and compression are mathematically equivalent — a model that predicts well compresses well, and vice versa. The competition measures the compression side of that equivalence.
This framing matters because it legitimizes approaches beyond pure language modeling: sliding window eval improves compression by giving more context. Backward-looking TTT adapts to already-scored tokens for better compression. These are valid compression strategies.
There is no separate held-out test set — the FineWeb validation set is the fixed evaluation target. However, val tokens cannot be stored in the artifact (paid prefix ruled out), and pre-eval adaptation on val data is also ruled out. Only backward-looking TTT (adapting on tokens already graded) is permitted.
"Tokenizer-agnostic" means BPB normalizes across tokenizers. A bigger vocabulary uses fewer tokens but more bits per token — BPB cancels that out, measuring compression of raw bytes regardless of how they're tokenized.
Record submission requirements: Artifact ≤16,000,000 bytes (code + compressed model). Training ≤10 min on 8xH100 SXM. Evaluation ≤10 min (separate budget). No network calls. New SOTA records must beat the current best by ≥0.005 nats at p < 0.01 significance (typically 3 seeds). Evaluation methods are unrestricted — any sequence length, sliding window, etc. are fair game. Test-time training is allowed only on already-evaluated tokens (backward-looking); pre-eval adaptation on val data is ruled out.
In ~3.5 days since launch, the community has driven BPB down by ~0.10 (to 1.1250 pending). Over 349 PRs submitted, including hardware-constrained entries from consumer GPUs and Apple Silicon.
Official Leaderboard (Top 5)
| Rank | Score | Author | Key Techniques | PR |
|---|---|---|---|---|
| 1 | 1.1428 | @thwu1 | 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD 0.04 | #180 |
| 2 | 1.1458 | @raahilshah | Int6 MLP3x + SmearGate + BigramHash + OrthoInit + MuonWD + SWA | #162 |
| 3 | 1.1502 | @aruniyer | 11L + Int6 QAT + MLP3x + WD 0.04 + zstd-22 | #86 |
| 4 | 1.1556 | @aquariouseworkman | SmearGate + OrthoInit + Int6 STE QAT + MLP3x + Sliding Window | #65 |
| 5 | 1.1586 | @yahya010 | 10L Int6 QAT + Zstd MLP2.6x + Muon 0.99 + Sliding Window | #63 |
Best validated pending: #315 by @jfprincz — 1.1250 BPB (3-seed, Partial RoPE + LN Scale + Late QAT + XSA4 + EMA, no TTT, p < 0.01) | Full tables below ↓
Pending: Meets Record Requirements (Top 7)
These submissions demonstrate self-validated ≥0.005-nat improvement over the SOTA at time of submission at p < 0.01 significance. Official SOTA is now 1.1428 BPB.
| BPB | Author | Δ nats | Seeds | Techniques | PR |
|---|---|---|---|---|---|
| 1.1250 | @jfprincz | 0.030 | 3 | 11L + Partial RoPE (16/64 dims) + LN Scale + Late QAT + XSA (last 4) + EMA (0.997) + FA3 | #315 |
| 1.1256 | @alertcat | 0.029 | 3 | 11L + #315 stack + TTT (3 ep SGD, freeze 2 blocks) — TTT neutral on #315's base (±0.001) | #338 |
| 1.1280 | @jfprincz | 0.025 | 3 | 11L + XSA (last 4) + EMA (0.997) + SmearGate + BigramHash + WD 0.04 + FA3 | #287 |
| 1.1313 | @timowhite88 | 0.019 | 3 | 11L Int6 MLP3x + SmearGate + TTT (3 ep SGD, freeze 2 blocks) + RoPE50K + SWA + WD 0.04 + FA3 |
#254 |
| 1.1320 | @saml212 | 0.018 | 3 | 12L + Gradient-Guided Quant (int7/6/5) + Partial RoPE + LN Scale + XSA4 + EMA + 524K batch + MLP 1408 | #332 |
| 1.1326 | @jfprincz | 0.017 | 3 | 11L + Int6 MLP3x + SmearGate + BigramHash + WD 0.04 + SWA + FA3 | #198 |
| 1.1400 | @saml212 | 0.005 | 3 | 11L Int6 + SmearGate + BigramHash + 524K batch + SWA + WD 0.04 | #236 |
Pending: Not Yet Validated
Submissions with BPB below official SOTA (1.1428) that haven't yet demonstrated statistical significance.
| BPB | Author | Techniques | PR |
|---|---|---|---|
| 1.1307 | @unnir | 11L + SmearGate + Partial XSA (last 3 layers) + SWA + WD 0.04 + FA3 | #265 |
| 1.1354 | @ibarrajo | 11L + Partial XSA (last 3) + TTT + 524K batch + RoPE50K (no FA3) |
#290 |
| 1.1357 | @dennisimoo | 11L + XSA (last 4) + EMA + 524K batch + WD 0.04 (no FA3, no TTT) | #307 |
| 1.1381 | @charmquark1984 | 11L + SmearGate + TTT + 524K batch + WD 0.042 + FA2 |
#281 |
| 1.1399 | @Mapika | 11L + XSA4 + EMA + Int5-MLP/Int6-Attn/Int8-Embed + 8% pruning (3-seed, fails 0.005-nat by 0.00004) | #349 |
| 1.1419 | @chris-buckley | 11L + XSA4 + EMA + TTT (no FA3, SDPA fallback, 5344/9000 steps; pre-quant 1.1581 — weaker base than #303) | #317 |
32 validated + 6 unvalidated below SOTA (1.1428) | Full tables below ↓
Untried Combinations
Ranked by expected value (likely gain times probability of working), grounded in competition ablation data:
Tier 1 — High expected value
- Reptile meta-TTT on Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's XSA+EMA base (new frontier). Naive SGD TTT is neutral (±0.001) on Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base (Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338, 3 seeds) — neither helping nor hurting. This is encouraging: the disruption seen on Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 ([Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303: +0.016 worse) doesn't carry to Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's stronger base. Reptile modifies training, not just eval — it meta-learns inner-loop directions that improve TTT initialization quality, worth 0.011 BPB on SmearGate models (10x naive, [Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645) #296). Since naive TTT is neutral on Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base (not destructive), Reptile's meta-learning overhead has a clean floor to build on. Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) #302 combined Reptile + causal TTT + decay prior on a different XSA base (1.1520, 1 seed). With Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315 at 1.1250, Reptile meta-TTT is the highest-EV remaining experiment. Estimated: 0.005-0.011 BPB, projecting ~1.114-1.120 BPB.
- Sequence length curriculum — PRACTICAL BLOCKER FOUND. @mahsumaktas's #333 (23 runs) found that SWA checkpoints are incompatible across sequence lengths — changing seq length mid-training breaks SWA averaging. Since SWA/EMA is essential at the frontier, this is a serious implementation hurdle. The technique might still work if EMA (per-step, no checkpoints) replaces SWA, but untested. Estimated: 0.003-0.008 BPB if the SWA issue is solved.
Tier 2 — Top picks (sorted by expected value)
- Reptile meta-TTT (#296). 0.011 BPB on SmearGate models (10x naive TTT). Trains inner-loop directions during last 20% of wallclock. Zero artifact cost. Caveat: gain may partly be from extra training steps. On Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base (new frontier): est. 0.005-0.011 BPB.
- Mousse optimizer (arXiv:2603.09697). Curvature-aware Muon — Shampoo preconditioning before orthogonalization. ~12% more effective training at 3% overhead. Drop-in. Est. 0.003-0.008 BPB.
- Entropy-coded weights (arXiv:2505.02380). Replace zstd with ANS/Huffman that exploits quantized weight distribution. Could free 1-2 MB for more params. Decoder must fit in artifact. Est. 0.003-0.008 BPB (indirect via capacity).
- OptRot pre-quantization (arXiv:2512.24124). Rotation matrix redistributes weight outliers before quantizing. Reduces int6 gap 30-50%. Fuses into adjacent layers — zero artifact cost. Est. 0.002-0.005 BPB. Drop-in.
- Turbo-Muon (arXiv:2512.04632). Preconditioned Newton-Schulz — 5-10% faster training. More steps in 600s. Significance test waived for systems-only changes. Est. 0.002-0.005 BPB.
More Tier 2 ideas (lower EV or higher complexity)
| Technique | Est. BPB | Key idea | Complexity |
|---|---|---|---|
| HybridNorm (arXiv:2503.04598) | 0.002-0.006 | Mixed Pre/Post-Norm for better depth utilization | Very low |
| PPM-C mixing (#283) | 0.003-0.008 | Classical compression blended with neural at eval time | Low |
| Differential Attention (arXiv:2410.05258) | 0.005-0.015 | Difference of two softmax maps; reduces outliers | High (arch change) |
| Lattice VQ (arXiv:2603.11021) | 0.005-0.015 | Joint 24-weight Leech lattice encoding; saves 2-4 MB | High (custom kernels) |
| VGA (arXiv:2510.09017) | 0.002-0.005 | Value-gated attention; fixes sliding window sinks | Low-moderate |
| Neural Cache cross-window KV (#318) | unknown | Cache K/V from prior windows so new queries attend to 50K+ context; zero artifact cost; untested | Low (FA3 already supports seqlen_k > seqlen_q) |
| Batch size warmup | 0.002-0.005 | Start at 128-256K, ramp to 524K; more early updates | Very low |
| Entropy-reg. QAT | ~0.003-0.008 | Compression penalty clusters weights for zstd; saves 0.5-1.5 MB | Low |
| Temperature scaling | 0.001-0.005 | Single scalar T corrects post-quant calibration | Trivial |
| PolyCom activations (arXiv:2411.03884) | 0.002-0.006 | Polynomial+norm replacing ReLU²; higher-order feature interactions | Very low |
| Predictive Batch Scheduling (arXiv:2602.17066) | 0.002-0.005 | Loss-aware data ordering (NOT content curriculum); 6-13% faster convergence | Low |
| Late-Stage SAM (arXiv:2410.10373) | 0.002-0.005 | Sharpness-aware minimization last 5-10%; flatter minima complement EMA | Moderate (Muon-SAM) |
| Cautious Weight Decay (arXiv:2510.12402) | 0.001-0.004 | WD only where sign-aligned with gradient; prevents WD fighting optimizer | Trivial (1 line) |
| Gated Attention (arXiv:2505.06708) | 0.002-0.005 | Per-head sigmoid gate eliminates attention sinks; NeurIPS 2025 Best Paper | Very low (~2K params) |
| Memory Tokens on frontier base (#352) | est. 0.005-0.014 | 64 learnable prefix embeddings as global scratchpad; ablation shows -0.014 BPB on 10L base. Untested on #315's 11L XSA+EMA base. ~8K params. | Very low |
| Value Residual / ResFormer (arXiv:2410.17897) | 0.002-0.006 | Layer-1 value shortcut to all layers; fights attention concentration; 20% fewer tokens needed | Very low (~11 scalars) |
| Deep Delta Learning (arXiv:2601.00417) | 0.002-0.005 | Rank-1 residual that can subtract info from stream; addresses norm growth at depth | Moderate (~5.6K params) |
| WaveletGPT (arXiv:2409.12924) | 0.003-0.010 | Multi-scale Haar wavelet structure on half of embedding dims; 40-60% faster convergence | Low (zero params) |
| AGGC adaptive gradient clipping (arXiv:2601.11864) | 0.002-0.005 | Per-group adaptive clip thresholds; exploits Q-matrix heterogeneity from #215 | Low (optimizer state) |
Tier 3 — Novel approaches, higher risk
- Knowledge distillation (teacher then student). Nobody has tried this. Train a larger model for ~7 min as teacher, then distill into the 16MB student for ~3 min using soft labels. Typical distillation gains at small scale: ~0.005-0.010 BPB. The tight time budget is the main constraint — the teacher needs to be meaningfully better than the student for distillation to help. Estimated: ~1.125-1.145 BPB if it works (unlikely to beat the current frontier without the full technique stack on top), and high implementation complexity.
- Mixture of Experts. @Complexity-ML's #250 (v4, superseding Complexity MoE + PID Dynamics (Token-Routed I64) #224) uses 4 SwiGLU experts with hybrid routing (modulo-base + learned override) and CUDA scatter dispatch. Awaiting compute credits, no results yet. Estimated: ~1.120-1.140 BPB (wide uncertainty).
- Partial weight sharing + 14L. Share weights across middle-layer pairs (e.g., layers 4-6 share, 7-9 share) with per-layer LoRA adapters (~1% of params). Saves ~3-5 MB, enough for 14 total layers. Recent "Relaxed Recursive Transformers" paper shows LoRA-adapted shared layers recover most of unique-layer quality. Estimated: net +0.005-0.015 BPB (depth gain minus sharing cost). High implementation complexity but highest potential artifact-efficiency play.
- nGPT hypersphere normalization (partial) (arXiv:2410.01131, NVIDIA). Constrain Q/K matrices to unit-norm rows (hypersphere). Eliminates the extreme Q condition numbers identified in Record: 11L Low-Rank Q192 (val_bpb=1.1548) #215 (100M+ → 1 by construction). Partial application (Q/K only, leave MLP free) is feasible. NVIDIA claims 4-20x convergence speedup for full nGPT. Estimated: 0.003-0.008 BPB but high implementation complexity and untested at this scale.
- BitNet + sliding window + SmearGate. @ksang123's #139 ternary 64.5M-param model (1.2029) uses no eval tricks. Adding sliding window (
-0.03) and SmearGate (-0.01) could reach ~1.16, but the ternary approach still has a fundamental quality gap vs int6 at this scale.
What Doesn't Work
Two failure patterns. (1) Throughput cost exceeds quality gain. In a 600s budget, anything adding >10% step overhead needs >10% per-step improvement to break even. QAT (#236: 115ms vs 67ms baseline), NorMuon (#236: 110ms), and MTP (#212, #236: 86ms) all fail this test. (2) Mechanism redundancy. Stacking two techniques that extract the same signal yields diminishing returns — TTT+XSA underperforms XSA-alone (#290 vs #265), error-guided TTT doesn't improve over uniform TTT (#296), EMA without XSA hurts (#201). Gains require new information, not better processing of existing signal.
- 12 layers at seq2048 (slower steps cancel extra capacity) — Non-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541 #219's 12L at seq2048 runs at 107ms/step, fitting only ~5,590 steps. Result: 1.1541 vs 11L's 1.1326. However, 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) #76 shows 12L at seq1024 (59ms/step, ~9000 steps) reaches 1.1468 — the tradeoff depends on sequence length.
- Late QAT at 12L is step-budget-dependent. @saml212's #332 found that at 12L, Late QAT added ~7ms/step overhead, costing ~770 training steps. At 11L, those steps would cost ~7ms each — but at 12L, each step is already more expensive and step count is already lower, so the overhead-to-gain ratio worsens. Result: Late QAT was dropped from the 12L submission. Takeaway: the same technique's cost-benefit flips depending on step time and total step count. Always re-evaluate overhead techniques when changing layer count.
- Int5-MLP tradeoff is layer-dependent — At 11L, 4 the Leaderboard: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) #236 found int5 quant penalty (0.029) outweighs artifact savings. But at 10L, Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180 uses int5 to fund BigramHash(10240) and is the official SOTA (1.1428). The lesson: int5's quality loss must be offset by what you buy with the saved space.
- Larger vocabularies + fewer layers — Record: Vocab 4096 + MLP 3x + Sliding Window Eval (mean val_bpb=1.1642, 3 seeds) #123 (vocab 4096, 8L) at 1.1642 and Record: SP4096 + Int6 QAT + NorMuon (val_bpb=1.2012) #200 (SP4096, 9L) at 1.2012 both underperform 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198's 11L at 1.1326. The embedding matrix gets 4x larger, forcing fewer layers. At current artifact sizes, depth wins over vocab breadth.
- SmearGate without OrthoInit — hurts BPB by 0.003 (see SmearGate deep dive).
- SWA with bf16 accumulation — Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations #212 found catastrophic precision loss when accumulating SWA checkpoints in bf16. Must use fp32. (However, @kellyvv's #238 found that with enough SWA checkpoints (84), the quant gap actually reverses — quantized BPB becomes 0.037 better than pre-quant. SWA smoothing eliminates quantization-sensitive outliers.)
- MTP (multi-token prediction) — Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations #212's controlled test: no BPB improvement (1.1947 vs 1.1929 control).
- Curriculum learning (content-based) — Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations #212 found no effect.
- LAWA-EMA replacing SWA — context-dependent. LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551) #201 tested EMA alone on 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 base: 1.1551 (0.023 worse than SWA). But Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 uses EMA (decay=0.997) WITH XSA and reaches 1.1280 — beating SWA. EMA needs XSA to work. EMA decay=0.999 was also tried on Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 and hurt BPB — too slow to average (per @jfprincz). The sweet spot is 0.997.
- cuDNN SDP vs Flash SDP — Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth #281 found cuDNN is 40% faster per attention op but produces worse BPB (1.1455 vs 1.1418). More steps doesn't help — different internal accumulation precision hurts quality.
- SwiGLU activation — worse than relu² on this architecture. Confirmed independently by V2 Prototype: SwiGLU + Dropout + MuonWD + MidLayerLoop #340 and Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100) #344.
- Step-based LR schedule — Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100) #344 found −0.483 BPB vs wallclock-based warmdown. Catastrophic because the 600s budget varies by hardware; step-count schedules can't adapt.
- Error-guided TTT — concentrating TTT on highest-loss tokens doesn't help; they're genuinely unpredictable (see TTT deep dive).
- TTT hurts strong XSA+EMA bases; helps weak ones — nuanced ([Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303 vs Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) #317). @sseanliu ([Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303) tested SGD TTT on Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287's strong XSA+EMA base: +0.016 worse (1.1280 → 1.1436, 2 seeds). But @chris-buckley (Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) #317) applied TTT to a weaker XSA+EMA base (pre-quant 1.1581, no FA3, SDPA fallback) and got -0.024 better (1.1655 → 1.1419 post-TTT, 1 seed). The pattern: TTT helps when the base is well below the frontier, but disrupts the weight landscape of a well-tuned base. Caution: Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) #317 is 1 seed and 5344/9000 steps (SDPA slower than FA3), so the effect may partly be slower training compensated by TTT. This remains the most nuanced negative result.
- INT4 quantization — Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth #281 tested all-INT4: fits 33.5M params but 0.06 BPB quant gap makes it strictly worse than INT6 with fewer params.
- FTLE per-row precision (Non-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035) #316) — Dynamical systems-inspired row-level quantization (Lyapunov exponent tracking). Clean negative result: uniform int-N beats FTLE-guided mixed precision at every bit width, because mixing bit widths within a row increases quantized value entropy, which defeats zstd compression. Lower RMSE does not imply smaller artifact.
The Current Baseline Stack
The foundation that most competitive submissions share. Worth noting: several top submissions diverge from consensus in specific ways that paid off — #180 used int5 (now official SOTA), #236 used 524K batch instead of 786K, #76 dropped QAT and raised LR, #265 added XSA from a recent paper. The meta is a strong starting point, but the data shows room to improve individual components.
The top entries (#198 at 1.1326) extend this with 11 layers, SmearGate+BigramHash+OrthoInit, WD 0.04, and SWA:
The core five: Integer quantization (int6-all or int5-MLP/int6-attn) + MLP 3x expansion + sliding window eval (stride=64) + zstd-22 compression + precision passthrough for sensitive layers (usually FP16 tied embedding; #236 uses int8 to fund MLP capacity). Near-universal across all competitive submissions, though quant precision varies — #76, #267, and official SOTA #180 use int5-MLP to fund larger BigramHash or extra layers.
Near-consensus optimizer settings: Muon momentum 0.99 (warmup from 0.92 over 1500 steps), halved LRs (matrix=0.02, scalar=0.02, embed=0.03), warmdown 3000 iters, grad clip 0.3. Most top submissions use these. Exceptions: @unixmadtoonslab's #76 (1.1468) uses higher LRs (0.03) and lower momentum (0.97). @saml212's #236 (1.1400, #4 validated) uses 524K batch instead of the consensus 786K — 22% more gradient updates outweigh 17% fewer tokens, worth 0.017 BPB. Batch size may be the most under-explored hyperparameter.
Part of the top stack: SmearGate + BigramHash + OrthoInit — used by most top validated entries. Requires OrthoInit to work (per #212's ablation). 11 layers + WD 0.04 + weight averaging (SWA or EMA). The frontier (#315, 1.1250) builds on EMA (decay=0.997) + XSA on last 4 layers with three new zero-parameter techniques: Partial RoPE (16/64 dims), LN Scale (1/√(layer+1)), and Late QAT (STE enabled final 4% only).
Common but not universal: QAT with STE (~half — notably #76 at 1.1468 succeeds without it, finding WD=0.04 alone sufficient for quantization robustness), SWA (~10/21 validated), NorMuon (~3/21), FA3 (~5/21).
1. Int6 Quantization (instead of Int8)
Standard post-training quantization maps each weight to an 8-bit integer (256 levels). Int6 uses only 6 bits (64 levels, range [-32, 31]) with per-row scale factors, then compresses with zstd (level 22) instead of the baseline's zlib-9.
Why it matters here: The competition caps artifacts at 16MB. Int6 frees up ~25% more artifact space than int8. That freed space gets reinvested in a bigger model — which more than compensates for the precision loss. The tradeoff is more quantization error, which is why several submissions keep certain sensitive layers in fp16 (notably the tied embedding matrix, which is particularly vulnerable because the same matrix is used for both input embeddings and output predictions — errors compound in both directions).
Origin: @nanlliu introduced int6 mixed precision in #39, using int6 for compression-friendly middle layers while keeping int8 for precision-sensitive first/last layers.
2. MLP 3x Expansion
The baseline uses a 2x MLP expansion (hidden dim 1024 for a 512-dim model). Top submissions increase this to 3x (hidden dim 1536). The MLP layers are where transformers do most of their nonlinear feature transformation between attention operations. A wider MLP means more expressive capacity.
Origin: @jfprincz first used MLP 3x in #70 (commit Mar 19 08:57 UTC), recognizing that int6 freed enough artifact space for a 3x expansion — trading precision for capacity. @saml212 independently reached the same insight in #61 later that day. Adopted by 10+ subsequent submissions.
3. Sliding Window Evaluation
Instead of chopping validation text into non-overlapping chunks (where tokens near the start of each chunk have minimal context), sliding window uses overlapping windows. With stride=64 and window=2048, each token gets scored with 1984+ tokens of context.
This is purely an evaluation-time technique — it doesn't change the model. It just ensures every scored token has rich context. Based on @samacqua's ablation data in #77, sliding window with stride=256 chunks improved BPB by 0.034 (from 1.2168 to 1.1941 with document isolation).
Origin: @mattqlf introduced sliding window eval in #50 and contributed a bug fix for partial window scoring in #124. Adopted by virtually every competitive submission.
Stride debate: Most use stride=64, but @saml212 found in #114 that stride=256 gives marginally better BPB (1.1574 vs 1.1579) at 4x less eval time.
Doc isolation depends on stride: @mrdavtan's ablation (#199) found that doc-isolated eval hurts by 0.009 BPB at stride=64 — start-of-document tokens lose all context when windows restart at boundaries. At stride=256+, doc isolation helps (per #77). If you use stride=64, use flat-stream eval.
4. FP16 Tied Embedding
Language models have an embedding matrix that converts tokens (words/subwords) into numerical vectors the model can process. "Tied" means the same matrix is reused at the end to convert the model's output vectors back into token probabilities — one matrix doing double duty as both dictionary and predictor. The baseline quantizes this matrix like all other weights, but because errors hit both input and output, the damage compounds.
Every competitive submission keeps this matrix in fp16 (16-bit floating point, ~1MB) instead of quantizing it. It's the single highest-value precision decision — one matrix, disproportionate impact.
Origin: @chonchiog introduced FP16 tied embedding in #42, the second record on the leaderboard.
5. Zstd-22 Compression
After quantizing weights to int6, they need to be compressed to fit in 16MB. The baseline uses zlib (the algorithm behind gzip). Zstandard (zstd) at compression level 22 squeezes int6 data significantly tighter — the difference is enough to fit ~1-2M more parameters in the same budget. Level 22 is near the maximum (22 out of 22), trading compression speed for ratio. Since compression happens once after training and decompression is fast, this is a free lunch.
Every int6 submission uses zstd-22. The int6 + zstd pairing is what makes 21M+ parameter models fit in 16MB.
The Path Down: What Separates Each Tier
The competition spans a 0.10 BPB range from baseline (1.2244) to the best pending entry (1.1250). What separates each tier isn't just techniques — it's a fundamentally different way of approaching the problem. Four tiers now, with the frontier (#315 at 1.1250, #287 class at 1.128-1.130) constituting a qualitatively distinct layer of architecture and regularization search.
Tier 1: Tweaking the Baseline (1.20–1.22 BPB)
Submissions in this range make one or two changes to the baseline: a longer sequence length, a learning rate sweep, a warmdown adjustment. The approach is "how do I improve this model?" — treating the baseline as mostly correct and looking for low-hanging fruit.
This works for the first 0.02 BPB, but hits a wall fast. The constraint isn't hyperparameters — it's the artifact budget. At int8+zlib, you can't fit enough model capacity to go further. Many submissions in this range are also on non-standard hardware (RTX 4090, Apple Silicon, 1xH100), which limits training tokens and disqualifies from the record track.
What to do if you're here: Adopt the core five (int6, MLP 3x, sliding window, FP16 embed, zstd-22) as a package. Each technique is well-documented in the deep dives below. Together they're worth ~0.05-0.07 BPB — the single biggest jump available.
Tier 2: Stacking Known Techniques (1.15–1.18 BPB)
These submissions adopted the core five and are assembling additional techniques: SmearGate, BigramHash, SWA, QAT, NorMuon. The approach is "what techniques exist and how do I combine them?" — surveying PRs, identifying high-impact components, and building a combined recipe.
This is effective: the leap from 1.22 to 1.16 is largely a stacking exercise. But submissions in this range often stop at "I added all the techniques" without investigating interactions. Common patterns: using SmearGate without OrthoInit (which hurts — per #212's ablation), running QAT from the start (which hurts — late QAT at 70-85% is better), or using SWA without sufficient weight decay (SWA shows no effect below WD=0.04).
What to do if you're here: Run ablations. Remove one technique at a time and measure the delta. You'll often find that one "improvement" is actually hurting because of interaction effects. Check your hyperparameters against the consensus (LR=0.02, momentum=0.99, warmdown=3000) but also against divergent successes like #76 (LR=0.03, momentum=0.97). Multi-seed validation (3 seeds) is essential — single-seed scores can be off by 0.002+ BPB.
Tier 3: Understanding Interactions (~1.130–1.15 BPB)
These submissions adopted the full technique stack and understood why each technique works. @jfprincz (#198 at 1.1326) is the canonical example: 11 layers + SmearGate + BigramHash + OrthoInit + WD 0.04 + SWA + FA3 assembled into a coherent system where each piece reinforces the others — WD makes weights compressible AND quantization-friendly, SmearGate+OrthoInit inject bigram context the small model can't learn from attention alone, and SWA smooths the weight landscape during warmdown.
The approach is "how do these techniques interact, and what's the optimal system?" Key markers of Tier 3 thinking:
- Ablation-driven development — every addition is measured, not assumed helpful
- Precision budgeting — spending fp16 only where quantization error hurts most (tied embedding, late-layer keys)
- Divergent exploration — 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) #76 found that higher LR + lower momentum + no QAT outperforms consensus settings at 1.1468. Record: 11L Low-Rank Q192 (val_bpb=1.1548) #215 discovered that Q matrices are naturally low-rank (100M+ condition numbers) and factoring them saves 22% step time
- Statistical rigor — 3+ seeds, significance testing, honest evaluation
What to do if you're here: Solidify your baseline with multi-seed validation, then attack via Reptile meta-TTT (#296: 0.011 BPB on SmearGate models, vs 0.001 for naive TTT) or XSA on the last 3-4 layers. Both are proven gainers at this tier. Reptile trains the model's inner-loop directions during training so TTT starts from a better initialization — much more effective than naive TTT on SmearGate models. See the TTT deep dive below.
Tier 4: Architecture Frontier (<~1.130 BPB)
The frontier has advanced: @jfprincz's #315 (1.1250) adds three zero-parameter techniques — Partial RoPE (16 of 64 head dims), LN Scale (RMSNorm output scaled by 1/√(layer+1)), and Late QAT (STE int6 enabled only in final 4% of training) — on top of #287's XSA+EMA base, gaining 0.0023 BPB.
The key insight at Tier 4: EMA (decay=0.997) outperforms SWA on the XSA stack, while SWA outperforms EMA on the #198 base. Simply swapping components from Tier 3 recipes doesn't work; technique interactions require system-level understanding. #315 demonstrates that the XSA+EMA base still had headroom extractable via careful regularization — Partial RoPE improves generalization by making some head dimensions position-free, LN Scale stabilizes depth by damping deeper layers, and Late QAT delays quantization noise until the model has converged.
What to do if you're here: Naive SGD TTT is a dead end on well-tuned XSA+EMA bases (#303: +0.016 worse on #287). The remaining high-EV paths are: (1) Reptile meta-TTT (NOT naive TTT) on #315's base — Reptile modifies training so it may avoid EMA disruption; (2) non-TTT innovations: Mousse optimizer, PolyCom activations, Late-Stage SAM, entropy-coded weights (see Tier 2); (3) batch size 524K with FA3 — #307 tested 524K on XSA+EMA without FA3 (1.1357), but the FA3 confound makes it hard to isolate batch-size effect. The new question: do #315's three techniques stack with Reptile, or do they already fill the same regularization role?
Technique Interactions Matter More Than Technique Count
A recurring pattern: techniques that work independently can fail in combination. TTT+XSA actively hurts (#303: +0.016 worse), EMA fails without XSA (#201) but succeeds with it (#287), and 12L fails at seq2048 but works at seq1024 (#219 vs #76). The frontier submissions (#287, #315) add techniques that address specific failure modes of their base, rather than stacking unrelated improvements — but this principle applies broadly: @unnir's XSA addresses self-value bias, @saml212's batch tuning addresses gradient noise, @sseanliu's Reptile addresses TTT initialization quality, and @jfprincz's #315 trio targets position over-reliance, depth noise, and quant gap respectively.
The untried combinations above should be evaluated against your specific model's weaknesses, not applied blindly. The strongest remaining candidates (Reptile meta-TTT, OptRot, sequence curriculum) are the ones with clear mechanistic hypotheses about what they fix.
Val-Data Approaches — Ruled Out for Record Track
Organizer ruling (Mar 20): Val tokens cannot be used in the artifact — train and load as if you have no access to val. Organizer @0hq has directed #168 to move to non-record, where it will be accepted.
This rules out paid prefix (#168), error correction tables (#108), and val-only training for the record leaderboard.
TTT ruling (Mar 20, @0hq on #152): "You can't train on the validation tokens before you evaluate on those same tokens. It doesn't matter if you causal mask." Only backward-looking TTT is allowed — you may adapt on val tokens already graded, not future tokens. Pre-eval adaptation (train N epochs on val, then evaluate) is invalid. This affects #254-style full-model SGD TTT and similar approaches that adapt before scoring. Causal TTT (#267-style: evaluate each chunk → adapt on scored tokens → evaluate next chunk) remains allowed.
Technique Deep Dives
The Muon Optimizer Family
Muon (MomentUm Orthogonalized by Newton-Schulz) is the optimizer at the heart of this competition's baseline, created by Keller Jordan for the NanoGPT speedrun. It runs standard SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via Newton-Schulz iteration. Intuitively: compute the gradient direction, then "clean it up" so the update is maximally informative without redundant directions. It's equivalent to steepest descent under the spectral norm, which improves the conditioning of the optimization landscape. ~35% faster training than AdamW on language models.
NorMuon extends Muon by adding per-neuron adaptive learning rates from accumulated second-order statistics. Vanilla Muon can produce updates with highly non-uniform norms across neurons, causing some neurons to dominate training. NorMuon normalizes row-wise after orthogonalization, combining Muon's conditioning benefits with Adam-style balanced per-neuron learning. It also improves distributed scaling by avoiding full momentum gathering across GPUs. Used by @mtybadger (#122), @vmfunc (#89), @abhishekgahlot2 (#137), and others.
Muon Weight Decay — The competition baseline's Muon optimizer has no weight decay. Decoupled weight decay for Muon (p.mul_(1 - wd * lr)) existed in modded-nanogpt since Nov 2025, but wasn't in the baseline. @notapplica was the first to bring it into this competition in #60, improving BPB from 1.2160 to 1.2094. Weights stay smaller and better-distributed, improving both generalization and compressibility.
Quantization-Aware Training (QAT) with STE
Instead of training in full precision and quantizing afterward, QAT simulates quantization during training. In the forward pass, weights are rounded to their quantized values. The problem: rounding is non-differentiable, so gradients can't flow through it.
The Straight-Through Estimator (STE) solves this by pretending the rounding operation is the identity function during the backward pass. It's mathematically "wrong" but works remarkably well — the model learns weight configurations that are robust to precision loss because it's been "seeing" quantized weights throughout training.
QAT timing matters: @trovatochris (#117) activates QAT at 70% of training (weight-snapping). @mohosy (#130) notes that standard STE can corrupt Muon's momentum subspace — the rounding noise gets amplified by orthogonalization rather than averaged out (unlike Adam). Their solution: activate QAT late (at 75% of training) and reduce learning rate by 50% when it kicks in. Meanwhile, @yahya010 (#63) reports zero quantization gap with full STE int6 QAT — the most aggressive quantization-aware training seen so far.
Late QAT outperforms full-training QAT: Activating QAT timing matters — the later, the better. @trovatochris (#117) activates at 70%, @mohosy (#130) at 75%, @unixmadtoonslab (#76) at 85%. #76 even dropped QAT entirely at 12L (1.1468), finding WD=0.04 alone sufficient. @jfprincz's #315 pushes this to the extreme: STE activates only in the final 4% of training (lr_scale < 0.1, during low-LR warmdown). This cuts the int6 roundtrip gap to ~0.007 BPB while preserving full-precision convergence. The lesson: QAT activation is a spectrum — later = cleaner convergence, better int6 gap.
Int8 vs int6 QAT tradeoff: @mrdavtan's ablation in #145 shows that int8 QAT is not worth it under the 10-min wallclock cap. The torch.quantile call for exact percentile matching adds ~20% per-step overhead (64ms → 77ms), costing ~2,000 training steps. Result: 1.2052 BPB with QAT vs 1.1925 without — the lost training tokens hurt more than closing the ~0.007 int8 quantization gap. Int6 QAT, however, likely pays off because its larger ~0.01+ BPB gap justifies the overhead — confirmed by #128 and #137.
SmearGate & Bigram Hash Embedding
@unnir introduced SmearGate in #102 and refined it in #135. This appears to be a novel technique for this competition — no published papers found.
SmearGate: A tiny learned gate (~512 params) that blends each token's embedding with the previous token's. This injects bigram (two-token) context directly into the embedding layer before the transformer starts processing. Normally a transformer must discover token pair relationships through self-attention; SmearGate provides this signal for free.
Bigram Hash: A hash table (commonly 2048-10240 buckets, dim=128, projected to 512) that maps token pairs to learned embeddings. Together with SmearGate, this gives the model token-pair awareness at nearly zero parameter cost.
@unnir's original combination with orthogonal initialization achieved 1.1539 BPB in #135. @jfprincz's #198 (1.1326) extended this with 11L + SWA + FA3 + WD 0.04, and #287 (1.1280) extended further with XSA + EMA.
OrthoInit appears critical for SmearGate. @mrdavtan's ablation in #212 found that adding SmearGate + BigramHash without OrthoInit hurt BPB (1.1739 vs 1.1708 without). Every successful SmearGate submission uses OrthoInit — the two techniques may be co-dependent.
Exclusive Self-Attention (XSA)
XSA (arXiv:2603.09078, Shuangfei Zhai, 2026) removes self-value bias from attention output via orthogonal projection. In standard attention, each token's value vector contributes to its own output — XSA subtracts this self-component, forcing the model to rely on information from other tokens. Applied to the last 3-4 layers only ("Partial XSA"), where self-attention bias is highest.
Zero parameters, minimal overhead. @unnir's #265 GQA-aware implementation reduces XSA overhead from ~7ms/step to ~2ms/step using free reshape + broadcast (no repeat_interleave). Combined with EMA (decay=0.997), XSA on last 4 layers produced the current best pending result: #287 at 1.1280 BPB. XSA on 3 layers (#265) reaches 1.1307 — more layers appear to help.
XSA is now near-universal among frontier submissions (#265, #287, #290, #277, #291).
Test-Time Training (TTT)
@samacqua introduced a creative approach in #77: adapting the model during evaluation.
For each validation document, rank-8 LoRA (Low-Rank Adaptation) adapters are trained on the document's own text using only backward-looking context (no data leakage). The model essentially "studies" each document briefly before being scored on it. LoRA makes this practical by only training tiny low-rank matrices (~1.5% of params) rather than the full model, enabling batched per-document adaptation within the eval time budget.
Measured impact from ablation in #77:
| Condition | val_bpb | Delta |
|---|---|---|
| Baseline (cross-doc, flat stream) | 1.2278 | — |
| + Document isolation | 1.2168 | -0.0110 |
| + Sliding window (chunk=256) | 1.1941 | -0.0337 |
| + LoRA TTT | 1.1910 | -0.0368 |
Most of the gain comes from document isolation and sliding window. The LoRA TTT itself adds ~0.003 BPB — but this used only ~1/10th of the eval budget.
Full-model SGD TTT: @timowhite88 (#152) showed 0.034 BPB gain from full-model SGD over the val set.
TTT + int6 stack confirmed (without SmearGate): @polarizedfortnite-cpu (#81) combined LoRA TTT with the int6+MLP3x+SwiGLU stack, getting 0.033 BPB from TTT (1.2004 → 1.1670). Nearly identical to @timowhite88's 0.034 on the baseline.
TTT on XSA+EMA is a spectrum, not a binary. On SmearGate bases: #254 shows 0.014 BPB gain. Three XSA+EMA data points, sorted by base strength: (1) #317 (weak base, pre-quant 1.1581, no FA3): TTT gains 0.024 BPB. (2) #338 (@alertcat, #315 base — frontier at 1.1250, Partial RoPE + LN Scale + Late QAT): TTT neutral ±0.001 (3 seeds). (3) #303 (@sseanliu, #287 base — 1.1280, without #315's additional regularization): TTT +0.016 BPB worse. The pattern suggests TTT interacts with how tightly converged the base model is: under-trained bases benefit from local adaptation; over-regularized frontier bases are disrupted; the current frontier (#315) sits in a neutral zone. #338's neutral result is informative — it means TTT is not a meaningful lever at the frontier.
Reptile meta-TTT partially overcomes SmearGate redundancy. @sseanliu's #296 shows 0.011 BPB on SmearGate models vs 0.001 naive. Reptile runs during the last 20% of training, meta-learning inner-loop directions. Caveat: gain may partly come from extra training steps. Whether Reptile overcomes the XSA+EMA incompatibility (#303) is the key open question. Also: error-guided TTT is a negative result — hardest tokens are genuinely unpredictable.
#315's Three New Techniques: Partial RoPE, LN Scale, Late QAT
@jfprincz's #315 (1.1250, new best pending) adds three zero-parameter techniques on top of #287's XSA+EMA base, together gaining 0.0023 BPB. Each is independently motivated and individually small, but they compound cleanly.
Partial RoPE (16 of 64 head dimensions). Rotary Position Embedding (RoPE) injects position information by rotating query/key vectors. Standard RoPE applies to all head dimensions. Partial RoPE applies to only 25% (16 of 64 dims) — the remaining 48 dims attend without position encoding. Why this helps: the position-free dims learn semantic similarity independent of token distance, improving generalization across different position ranges. The model can learn both "what things are" (position-free) and "where things are" (position-encoded) using different parts of the same head. Zero new parameters.
LN Scale (output scaled by 1/√(layer_idx+1)). After each RMSNorm, the output is multiplied by a layer-dependent scale factor that shrinks with depth. Layer 0: ×1.0; Layer 5: ×0.408; Layer 10: ×0.302. This damps the contribution of deeper layers to the residual stream, preventing later layers from "overwriting" early representations. Training is more stable — the model can use depth incrementally rather than being forced to route everything through deep layers. The 1/√(layer+1) schedule is related to the "depth scaling" used in some architecture papers. Zero new parameters.
Late QAT (STE enabled only when lr_scale < 0.1, i.e., final 4% of steps). Standard QAT runs STE fake-quantization throughout training, which corrupts Muon's momentum subspace — rounding noise gets amplified by Newton-Schulz orthogonalization. Earlier experiments (PR #76, #117, #130) found late-QAT activation at 70-85% helps. #315 pushes this to 96% — STE activates only during the final warmdown when learning rates are at their lowest and the weight landscape is nearly settled. The model converges in full precision, then the final 4% gently adapts to quantization. Result: int6 roundtrip gap cuts from ~0.04 BPB (no QAT) to ~0.007 BPB with this approach.
The three techniques together gain 0.0023 BPB vs #287 — modest but statistically clear (3-seed variance 0.0005 BPB, t-stat -101.9 vs SOTA, p << 0.01). Each technique is likely worth a fraction of a BPB independently; their joint effect is additive.
Archived: BitNet, Depth Recurrence, MTP
BitNet b1.58 (#139): 64.5M ternary params, 1.2029 BPB — 0.07 behind SOTA. Depth Recurrence (#103): 5 blocks looped to 30 virtual layers with LoRA adapters — not competitive. MTP (#88): no BPB improvement per #212's ablation. Overtone Spectral Init (#60): power-law SVD shaping of embeddings + phase-transition residual mixing. Foundational early technique (1.1748 BPB), replicated by @TevBenji (#69). Superseded by the current stack.
Low-Rank Q Factorization
@JayCheng113's #215 discovered that in trained transformers, the Q (query) projection matrices have extreme condition numbers (100M+) — meaning they naturally operate in a low-dimensional subspace. By factoring Q from a full 512×512 matrix (262K params/layer) into two smaller matrices 512→192→512 (196K params/layer), the PR saves 25% of Q params per layer. Critically, K (condition 19-29), V (5-8), and O (531-2620) are all full-rank — Q is the only attention matrix where low-rank factorization works.
The speedup is the bigger win: step time drops from 108ms to 77ms (−22%), giving 28% more training steps in the same 600s budget. Combined with 11L, int6, and Late-K passthrough, the result is 1.1558 BPB (3-seed mean).
The PR also documents three negative results: (1) content-dependent pre-rotation improved per-step quality but torch.compile added 9s overhead that negated the gain; (2) Legendre resid_mix initialization had no effect — the optimizer converges scalar resid_mix params in ~200 steps regardless; (3) depth-attention residual (inspired by Moonshot AI) was counterproductive — softmax can't express the negative weights that resid_mix needs.
Selective Precision: Late-K Passthrough
@takhir-iota's approach in #99 highlights an underappreciated nuance: not all weight matrices are equally sensitive to quantization. Their "Late-K Passthrough" keeps the key projection weights (c_k) of the final two transformer layers in fp16 while quantizing everything else to int6. Why the K matrices specifically? In attention, the key and query dot product determines which tokens attend to which — small errors in late-layer keys can cascade into large attention pattern changes. Early layers are more robust because their errors get corrected by subsequent layers.
This is part of a broader selective precision trend: @nanlliu (#39) started it by keeping first/last layers in int8 while middle layers use int6. @rsavitt (#128) and others keep the tied embedding in fp16. The principle is the same — spend your precision budget where quantization error hurts most.
Paid Prefix: Ruled Out (Historical)
@spokane-way's #168 stored 8.75MB of compressed validation tokens in the artifact for zero-cost positions, reaching 1.0238 BPB. @ibarrajo's #262 tried a rebalanced version (smaller prefix + stronger model) at 1.0539 BPB — showing that prefix coverage mattered more than model quality.
Ruled out for record track Mar 20: Val tokens cannot be used in the artifact. May be accepted as non-record. Pre-eval TTT on val data also ruled out (#152). Only backward-looking TTT (evaluate → adapt on scored tokens) is allowed.
Notable Non-Record Submissions
| Author | PR | Highlight |
|---|---|---|
| @mohosy | #130 | 7 toggleable improvements; QAT + Muon momentum analysis |
| @gwelinder | #104 | Hyperparam tuning on RTX 5090 (consumer hardware) |
| @MatoTeziTanka | #95 | PROTEUS EMA — reduces int8 quant loss 0.0072→0.0048 |
| @aamodbhatt | #93 #94 #111 | Systematic arch/compute exploration |
| @ibarrajo | #136 | Clean seq2048 baseline (1.2101, -0.014 vs baseline) |
| @matt-wright86 | #127 | Depth recurrence 3×3, dim 768, ALBERT embeddings, NTK RoPE |
| @nglain | #141 | 33-experiment sweep; found int6 STE + Muon conflict (+0.007) |
| @stukenov | #118 | RTX 4090 compat smoke test |
| @SoumilRathi | #133 | MLX heavy-share research harness |
| @kellyvv | #108/#232 | Error Correction Table — stores model's worst predictions, ~1.05 est. on 8xH100 |
| @mrdavtan | #145 | Int8 QAT ablation — overhead exceeds recovery |
| @timothywangdev | #220 | [WIP] First SSM (Linear Recurrent Unit) — non-transformer architecture |
| @alons23 | #216 | Ternary Universal Transformer — 68M params, 4×6 depth recurrence |
| @Cwarren15-A | #283 | PPM-C context mixer — classical compression blended with neural (0.015 BPB on baseline) |
| @sseanliu | #296 | Reptile meta-TTT — 0.011 BPB gain on SmearGate models (10x naive TTT). Error-guided TTT negative. |
| @integrate-your-mind | #289 | 11L seq1024 + U-Net skips (1.1518). TTT LoRA worse than sliding window alone on this base. |
| @gowtham0992 | #295 | Backout (learned residual subtraction) + mixed int5/int6 QAT + U-Net skips (1.1477, 1 seed) |
| @JackYoung27 | #302 | Online causal TTT + decay prior + Reptile + XSA + Pre-Q/K RMSNorm (1.1520, 1 seed) |
| @sseanliu | #303 | TTT on XSA+EMA = +0.016 worse — definitive negative interaction study (1.1436, 2 seeds) |
| @xuafeng | #306 | QAT Int5/Int6 on #180 base: post-training quant outperforms QAT by ~0.002 BPB — quant noise acts as beneficial regularization that QAT removes (1.14476, 1 seed) |
| @NewyorkDev | #309 | CLASE-Quant adaptive per-layer quantization: int8 for boundary layers, int6 for middle — saves ~15% vs uniform int8 (1.1914, 3 seeds) |
| @vishesh9131 | #310 | 10L seq2048 + TTT LoRA rank-8 on Q/V + warmdown quant (WD=15000) + Overtone spectral init. Pre-quant val_bpb 1.1787 (1 seed, int8+zlib) — above SOTA on artifact size alone |
| @small-cactus | #311 | [WIP] Late-attention internal control on #180 stack — gate-controlled attention scaling + EMA-zscore energy normalization. No results yet |
| @my-sonicase | #313 | Schedule tuning only (warmdown 3600, LR 0.06) on 1xA40 — 1.723 BPB, 8.4MB. Baseline-class, but a clean reference for how far schedule-only tuning reaches |
| @bjbjbjbjbjbj | #346 | Local single-GPU baseline reproduction (grad_accum=8, 1 shard). 1.3529 BPB at step 4200 (plateau). Honest baseline-class reference on constrained hardware. |
| @chanwoo-park-official | #312 | Canon ACD layers (Allen-Zhu 2025) on 9L stack — learnable 1D conv (k=3) placed before attention, before MLP, and in MLP hidden stream (avoids QKV=B for cost). 1.1668, 1 seed. Novel architecture technique; interesting if it scales to 11L. |
| @SkywardSyntax | #316 | 12L Low-Rank Q (r=128) + QAT int7 on 1xH100 (pre-quant 1.2035, awaiting 8xH100). Key negative result: FTLE per-row precision is a dead end — uniform int-N beats mixed-row at every bit width due to higher entropy defeating zstd. Layer sharing also abandoned at 512d (costs 0.09 BPB, no space benefit). |
| @aravhawk | #314 | 11L Int4 MLP QAT on #180 base — int4 MLP saves ~2MB to fund 11th layer vs #180's 10L int5. Awaiting 8xH100 results. Record track aspirant. |
| @Rhodrium | #331 | 10L MLP3x + BigramHash(2048) + SmearGate + OrthoInit + mixed int5/int6 + SWA + stride=32 eval. 1.1487 BPB, 3 seeds. Solid consensus stack; above SOTA but clean stride-32 reference on H100s (94/91ms/step). |
| @sheeki03 | #339 | Backout ablation: -0.0071 BPB on #198 base (1.1435→1.1364). First clean measurement. |
| @bopmite | #330 | 11L Int6 + Online Logit Bias — 1.1609. Novel technique; sparse description. |
| @Ananddna | #327 | TrigramHash (8192 buckets) + Partial RoPE (50%) + per-head temperature scaling + stride=32 eval. 1.1450, 2 seeds. Three novel techniques on 10L int5 base. |
| @kingjulio8238 | #328 | MLX prototyping harness (Apple Silicon, 1.9588). 25+ experiments. Key finding: FP16 embed + Muon WD → near-zero quant gap. Negatives: DenseFormer DWA +0.003, NTK-RoPE extrap +0.06, depth recurrence+int6 catastrophic. |
| @lee101 | #329 | LZMA6 compression (vs zstd-22) + INT8, 1.1721, 1 seed. Alternative compression path. |
| @mahsumaktas | #333 | 23-run systematic exploration (1.1565, 3 seeds). Key findings: seq curriculum fails (SWA incompatible across seq lengths), EMA causes 0.14 BPB quant gap on SWA-stack, MLP 2.75x sweet spot at 11L+SmearGate, Late QAT 75% cuts quant gap 0.023→0.006. |
| @sseanliu | #318 | Neural Cache research proposal — maintain per-layer KV cache across sliding windows, extending effective context from 2K to 50K+. Zero artifact cost, backward-looking compliant. Untested (torch.compile state bug). Proposed on #287 base (1.1284). |
| @Arth-Singh | #319 | Depth recurrence 5×3 (5 unique layers, 3 loops), dim=640, 6xH200. 1.2716 BPB at 4500 steps (early stop). Key negatives: loop gates initialized at 1/N pull representation ~67% back toward input, effectively giving 1.3 loops of compute instead of 3; removing U-Net skips hurts. Width (640) does not compensate for loss of layer diversity vs 9 unique layers at 512. |
| @andrewgcodes | #321 | Optimizer tuning (warmdown=10K, Muon backend steps=10, grad_clip=1.0, beta2=0.99) + sliding window on stock 9L int8+zlib baseline. 3-seed mean 1.1864 BPB (p < 0.01 vs baseline 1.2244). Clean reference: how far optimizer-only tuning reaches without architecture or quantization changes. Above official leaderboard SOTA (1.1428). |
| @megnat05-tmm | #320/#323 | Minimal recurrent motif (shared_block_size=1, recurrence_steps=2, gate_init=0.18). val_bpb=2.7873 (~1.92MB artifact). Very early-stage exploration of structural reuse via compact shared operator. |
| @Aum08Desai | #325 | #315-derived Looped Transformer — shared recurrent core with Partial RoPE, LN Scale, Late QAT, XSA4, Bigram features. 1.1462 BPB, untuned. Architectural reference for weight-shared loops. |
| @tobiascanavesi | #341 | Hybrid Depth-Recurrent: unique entry/exit layers + 4 shared×5 loops = 22 virtual layers from 6 weight blocks. Quant gap reduced from 0.40 to ~0 BPB. 1.3323 preliminary (2xH100, 954 steps). 8xH100 pending. |
| @aryanbhosale | #344 | Autoresearch on 1xH100. Negatives: SwiGLU worse, step-based LR −0.483 vs wallclock, LoRA TTT −0.09, block weight sharing 2x slower. |
| @fbedev | #348 | QAT + BigramHash(12288) + stride=32 on #180 base. 1.1444, 1 seed. Barely above SOTA — diminishing returns from BigramHash >10240. |
| @sp00mm | #352 | Memory Tokens: 64 learnable embeddings as global context scratchpad. A/B: -0.014 BPB. Uses #315 stack + MTP aux heads. 1.1659, 1 seed. |
| @anandks2006 | #345 | DART: First Differential Attention in competition + shared-weight recurrence. 1.852 BPB (Colab T4, 2000 steps). Proof-of-concept only. Student submission from Kerala. |
| @nathon-lee | #334 | 11L PartialRoPE + LNScale + SmearGate + BigramHash + U-Net skips + EMA+SWA + TTT on 1xH100 (80 min). val_bpb 1.2108. |
| @Skrisps26 | #354 | MLA (Multi-Head Latent Attention) negative result. kv_rank=128 MLA on 13L + SmearGate + BigramHash stack. Pre-quant 1.2838 — not competitive. Root cause: 83ms/step vs ~43ms baseline, halving token throughput (~3.7B vs ~7.2B tokens). Architecture quality is plausible; throughput cost makes it infeasible in the 600s budget. Clean negative on MLA in this setting. |
| @jackopenn | #336 | Hypernetwork prototype — shared-trunk MLP generates full GPT weights from compact conditioning vectors (9.34x compression, 26.5M target params from 2.8M hypernet params, 2.09MB artifact). No BPB result yet. Highest compression-ratio weight-generation approach seen. |
| (pending) | #355 | BigramHash(4096) + MLP992 + LR=0.08 + sliding window stride=64 on stock baseline. Pre-quant 1.1913, int8+zlib artifact 16.17MB (over limit). Non-record baseline-class exploration of LR and BigramHash scaling. |
| (pending) | #347 | [WIP] Two experiments: (1) LongContext 4096 — seq=4096 training + all SOTA techniques (10L, Muon WD, FP16 embed, NTK-RoPE base=40K) + sliding window stride=256; 4K training seen before any of sliding window, Muon WD, or spectral init existed. (2) QAT Int4→16L — nibble-pack int4 (2 weights/byte) fits 16L in 16MB vs SOTA's 10L (+60% params). Expected: ~1.14–1.16 BPB each. Full H100 results pending compute grant. |
Idea Lineage & Diffusion
Tracking how key techniques originated and spread through the competition:
| Technique | First Appeared | Originator | Adoption |
|---|---|---|---|
| Sliding Window Eval | #50 | @mattqlf | Near-universal (20+) |
| FP16 Tied Embedding | #42 | @chonchiog | ~10+ |
| Int6 Quantization | #39 | @nanlliu | ~15+ |
| MLP 3x Expansion | #70 | @jfprincz | ~12+ |
| Muon Weight Decay | #60 | @notapplica (from modded-nanogpt) | Several |
| Overtone Spectral Init | #60 | @notapplica | @peytontolbert (#155), @TevBenji (#69) |
| SmearGate / BigramHash | #102 | @unnir | @aquariouseworkman, @raahilshah, @jfprincz, @baudrillardsgh0st, @Julz19, @timowhite88, @thwu1, @mahsumaktas, @newjordan, @alertcat, @ajkpersonal, @unixmadtoonslab, @dexhunter, @MatthewHRockwell, @saml212, @andrewgcodes, @charmquark1984 |
| OrthoInit | #135 | @unnir (combined with SmearGate) | Near-universal among top SmearGate submissions. Critical co-dependency: SmearGate hurts without OrthoInit (#212 ablation). |
| Test-Time Training | #77 | @samacqua (LoRA TTT) | @timowhite88 (#152 SGD, #254 first TTT+SmearGate+11L), @polarizedfortnite-cpu (#81, first TTT+int6), @andrewgcodes (#267 Causal TTT), @charmquark1984 (#281), @ibarrajo (#290, TTT+XSA), @mohosy (#291, pending), @sseanliu (#296, Reptile meta-TTT), @davidpuertolas (#297), @alertcat (#338, TTT on #315 frontier base — neutral) |
| NorMuon | Multiple PRs | Convergent | @mtybadger, @vmfunc, @dexhunter, others |
| QAT with STE | Multiple PRs | Convergent | @rsavitt, @yahya010, @trovatochris, others |
| SWA | #89 | @vmfunc | @mtybadger (#122), @dexhunter (#156), others |
| Depth Recurrence | Multiple PRs | Independent | @MatthewHRockwell, @koushikkethamakka, @iverbovoy (#148), others |
| Int5 MLP Quantization | #76 | @unixmadtoonslab | @thwu1 (#180, first validated + official SOTA), @alertcat (#219, mixed int5/int6) |
| BigramHash Scaling (10240→16384) | #180 | @thwu1 | @andrewgcodes (#267, 16384), @gowtham0992 (#295, 10240) |
| Low-Rank Q Factorization | #215 | @JayCheng113 | Novel — no adopters yet |
| Partial XSA (Exclusive Self-Attention) | #265 | @unnir | @jfprincz (#287, 4-layer variant — new best), @ibarrajo (#290, +TTT), @mohosy (#277/#291, pending compute), @dennisimoo (#307, 524K batch variant), @saml212 (#332, 12L) |
| EMA Weight Averaging | #95 | @MatoTeziTanka (PROTEUS EMA) | @machdragon (#201, negative w/o XSA), @jfprincz (#287, successful with XSA), @mohosy (#291, pending), @dennisimoo (#307, with XSA, no FA3), @saml212 (#332, 12L) |
| Reptile Meta-TTT | #296 | @sseanliu | @JackYoung27 (#302, +causal TTT + decay prior) |
| BitNet b1.58 | #126, #139 | @Athenox14, @ksang123 | Two independent |
| Partial RoPE | #315 | @jfprincz (25% dims) | @Ananddna (#327, 50% dims), @saml212 (#332, 12L) |
| LN Scale (1/√layer) | #315 | @jfprincz | @saml212 (#332, 12L) |
| Late QAT (last 4% only) | #315 | @jfprincz | Novel — no adopters yet (dropped at 12L due to step overhead) |
| Gradient-Guided Quant | #332 | @saml212 | Novel — per-tensor int7/6/5 based on gradient magnitude |
| TrigramHash | #327 | @Ananddna | Novel — extends BigramHash to token triplets (8192 buckets) |
| Per-Head Temperature | #327 | @Ananddna | Novel — each head learns its own temperature scalar |
Predictions & Commentary
-
TTT on XSA+EMA: a spectrum, not a binary (REVISED). Three data points now: [Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303 tested SGD TTT on Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287's optimized XSA+EMA base: +0.016 BPB worse (1.1280 → 1.1436, 2 seeds). Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) #317 (@chris-buckley) applied TTT to a weaker XSA+EMA base (pre-quant 1.1581, no FA3): −0.024 BPB (helps). Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338 (@alertcat) applied SGD TTT to Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base (the current frontier at 1.1250, stronger regularization): neutral (±0.001, 3 seeds). The pattern is cleaner than expected — TTT tracks base quality: weak base (helps), frontier-minus-one base (disrupts), frontier base (neutral). Mechanistically: Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's Partial RoPE + LN Scale + Late QAT may produce a weight landscape that's neither as exploitable as Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442) #317's nor as disrupted as Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287's — the model converges differently and TTT has no net edge. Reptile meta-TTT on Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base is still untried — Reptile's meta-learned inner-loop directions during training are a qualitatively different intervention and may still yield gains.
-
Frontier endpoint: ~1.117-1.125 BPB (REVISED DOWN). Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315 at 1.1250 extends the frontier 0.003 below the previous best (Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 at 1.1280) via three zero-parameter techniques: Partial RoPE, LN Scale, and Late QAT. This is a clean ablation-driven advance — 0.0023 BPB gain from a 3-technique package, each individually small but collectively meaningful. The path to sub-1.12 now looks more plausible: adding Reptile meta-TTT (~0.005-0.011) on Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base would project to ~1.114-1.120. The improvement pace has restarted.
-
Reptile meta-TTT: first adopter confirmed. @JackYoung27's #302 combined Reptile + causal TTT + decay prior on an XSA base (1.1520, 1 seed). The decay prior (
p += λ(p₀ - p)) addresses catastrophic forgetting during TTT. Whether Reptile can overcome the TTT+XSA+EMA incompatibility ([Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303) remains the key open question — now targeted at Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's base. -
12L+width vs 11L+depth: data from Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332. @saml212's Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332 (12L + GGQ + wider MLP 1408 + full frontier stack) achieved 1.1320 — worse than Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) #315's 11L at 1.1250. This is the clearest data point yet on depth vs width tradeoffs at the frontier: adding a 12th layer while widening MLP to 1408 (vs 11L 1344) does not compensate for slower step time and fewer total steps. 12L at seq2048 gets ~6,600 steps (94ms/step); 11L gets ~8,900 steps (67ms/step). The extra ~2,300 steps appear to matter more than the extra capacity. However, this is confounded by Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332 not having Late QAT active — a direct controlled comparison is still needed.
Full Official Leaderboard (14 entries)
| Rank | Score | Author | Key Techniques | PR |
|---|---|---|---|---|
| 1 | 1.1428 | @thwu1 | 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD 0.04 | #180 |
| 2 | 1.1458 | @raahilshah | Int6 MLP3x + SmearGate + BigramHash + OrthoInit + MuonWD + SWA | #162 |
| 3 | 1.1502 | @aruniyer | 11L + Int6 QAT + MLP3x + WD 0.04 + zstd-22 | #86 |
| 4 | 1.1556 | @aquariouseworkman | SmearGate + OrthoInit + Int6 STE QAT + MLP3x + Sliding Window | #65 |
| 5 | 1.1586 | @yahya010 | 10L Int6 QAT + Zstd MLP2.6x + Muon 0.99 + Sliding Window | #63 |
| 6 | 1.1630 | @aquariouseworkman | Mixed int6/int8 + MLP3x + Sliding Window | #65 |
| 7 | 1.1748 | @notapplica | Sliding Window + FP16 Embed + 10L + Muon WD + Spectral Init | #60 |
| 8 | 1.1925 | @mattqlf | Sliding Window Eval (stride=64) | #50 |
| 9 | 1.1928 | @samacqua | LoRA Test-Time Training | #77 |
| 10 | 1.2014 | @spokane-way | 4k seq length + tuned hyperparams | #52 |
| 11 | 1.2060 | @spokane-way | 2048 seq length | #49 |
| 12 | 1.2147 | @nanlliu | 10 layers, mixed int8/int6 | #39 |
| 13 | 1.2197 | @chonchiog | FP16 Tied Embedding + LR/Warmdown Tuning | #42 |
| 14 | 1.2244 | Baseline | 9L 512dim 1024vocab TiedEmbed 4 KV heads | — |
All Pending Validated Submissions
Validated against the SOTA at submission time. Δ nats shown vs SOTA at time of validation.
| BPB | Author | Δ nats | Seeds | Techniques | PR |
|---|---|---|---|---|---|
| 1.1250 | @jfprincz | 0.030 | 3 | 11L + Partial RoPE (16/64) + LN Scale + Late QAT + XSA (last 4) + EMA (0.997) + FA3 | #315 |
| 1.1256 | @alertcat | 0.029 | 3 | 11L + #315 stack + TTT (3 ep SGD, freeze 2 blocks) — TTT neutral on #315's base (±0.001) | #338 |
| 1.1280 | @jfprincz | 0.025 | 3 | 11L + XSA (last 4) + EMA (0.997) + SmearGate + BigramHash + WD 0.04 + FA3 | #287 |
| 1.1313 | @timowhite88 | 0.019 | 3 | 11L Int6 MLP3x + SmearGate + TTT (3 ep SGD, freeze 2 blocks) + RoPE50K + SWA + WD 0.04 + FA3 |
#254 |
| 1.1320 | @saml212 | 0.018 | 3 | 12L + Gradient-Guided Quant (int7/6/5) + Partial RoPE + LN Scale + XSA4 + EMA + 524K batch + MLP 1408 | #332 |
| 1.1326 | @jfprincz | 0.017 | 3 | 11L + Int6 MLP3x + SmearGate + BigramHash + WD 0.04 + SWA + FA3 | #198 |
| 1.1400 | @saml212 | 0.005 | 3 | 11L Int6 + SmearGate + BigramHash + 524K batch + SWA + WD 0.04 | #236 |
| 1.1402 | @andrewgcodes | 0.017 | 3 | 10L Int5-MLP + BigramHash(16384) + Causal TTT + SWA(0.3) + WD 0.08 + 786K batch | #267 |
| 1.1468 | @unixmadtoonslab | 0.047 | 3 | 12L Int5-MLP + SmearGate + BigramHash + SWA + no QAT | #76 |
| 1.1472 | @devin-cog | — | 3 | 11L + Int6 + Muon WD 0.038 + LR 0.025 + Sliding Window | #179 |
| 1.1480 | @baudrillardsgh0st | 0.045 | 3 | 11L + Int6 QAT + Per-Dim SmearGate + SWA + WD 0.038 |
#194 |
| 1.1507 | @dexhunter | 0.041 | 3 | Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 | #206 |
| 1.1538 | @jfprincz | 0.035 | 3 | OrthoInit + Int6 MLP3x + SmearGate + BigramHash + FA3 | #164 |
| 1.1541 | @alertcat | 0.035 | 3 | 12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + SWA | #219 |
| 1.1546 | @tamoghnokandar | 0.034 | 3 | Int6 MLP3x + NorMuon + FA3 + selective precision | #173 |
| 1.1558 | @JayCheng113 | 0.032 | 3 | 11L + Low-Rank Q (r=192) + Int6 + Sliding Window | #215 |
| 1.1575 | @saml212 | 0.029 | 3 | Int6 + MLP 3x + selective precision + long-context | #114 |
| 1.1577 | @yahya010 | 0.029 | 3 | Int6 QAT + BigramHash + MLP 1344 + MuonWD 0.02 + Sliding Window | #150 |
| 1.1602 | @dexhunter | 0.025 | 3 | Int6 STE + NorMuon + SWA + MLP3x + Sliding Window + U-Net skips | #156 |
| 1.1605 | @seanward | 0.021 | 3 | Int6 MLP3x + MTP + Sliding Window (mean 1.1625) | #88 |
| 1.1605 | @takhir-iota | 0.022 | 3 | Int6 MLP3x + Late-K Passthrough + SlidingWindow | #99 |
| 1.1622 | @vmfunc | 0.021 | 3 | NorMuon + int6 STE + SWA + sliding window | #89 |
| 1.1632 | @arjun-krishna1 | 0.020 | 3 | AutoResearch agent + MLP 3x + STE int6 QAT + seq4096 | #66 |
| 1.1642 | @saikrishnarallabandi | 0.018 | 3 | Vocab 4096 + MLP 3x + Sliding Window | #123 |
All Not Yet Self-Validated Submissions
Only submissions with BPB < official SOTA (1.1428) that haven't yet demonstrated ≥0.005-nat significance.
| BPB | Author | Seeds | Techniques | PR |
|---|---|---|---|---|
| 1.1307 | @unnir | 1 | 11L + SmearGate + Partial XSA (last 3 layers) + SWA + WD 0.04 + FA3 | #265 |
| 1.1354 | @ibarrajo | 1 | 11L + Partial XSA (last 3) + TTT + 524K batch + RoPE50K (no FA3) |
#290 |
| 1.1357 | @dennisimoo | 1 | 11L + XSA (last 4) + EMA + 524K batch + WD 0.04 (no FA3, no TTT) | #307 |
| 1.1381 | @charmquark1984 | 3 | 11L + SmearGate + TTT + 524K batch + WD 0.042 + FA2 |
#281 |
| 1.1399 | @Mapika | 3 | 11L + XSA4 + EMA + Int5-MLP/Int6-Attn/Int8-Embed + 8% pruning (fails 0.005-nat by 0.00004) | #349 |
| 1.1419 | @chris-buckley | 1 | 11L + XSA4 + EMA + TTT (pre-quant 1.1581; no FA3, SDPA fallback, 5344/9000 steps; seeds 2/3 pending) | #317 |
Glossary
| Term | Meaning |
|---|---|
| BPB | Bits Per Byte — measures how well the model compresses text. Lower = better |
| val_bpb | BPB measured on the FineWeb validation set |
| Muon | Optimizer that orthogonalizes gradient updates via Newton-Schulz iteration |
| NorMuon | Muon + per-neuron adaptive learning rates |
| QAT | Quantization-Aware Training — simulates quantization during training |
| STE | Straight-Through Estimator — passes gradients through non-differentiable rounding |
| Int6/Int8 | 6-bit or 8-bit integer quantization of model weights |
| SWA | Stochastic Weight Averaging — averages weights across checkpoints |
| LAWA | Latest Weight Averaging — SWA but only late-stage checkpoints |
| MTP | Multi-Token Prediction — auxiliary objective for richer gradients |
| LoRA | Low-Rank Adaptation — tiny trainable matrices injected into layers |
| TTT | Test-Time Training — adapting the model during evaluation |
| zstd | Zstandard compression — better ratio than zlib for weights |
| FA3 | FlashAttention 3 — optimized attention kernel for H100s |
| XSA | Exclusive Self-Attention — removes self-value bias via orthogonal projection |
| EMA | Exponential Moving Average — smooth weight averaging every step (vs SWA's periodic) |
Changelog
| Time | Update |
|---|---|
| Mar 21, 11:30 AM | Count fix: 32 validated (was 24). +#355 non-record. |
| Mar 21, 10:30 AM | +#354 MLA negative (throughput). Memory Tokens Tier 2. |
| Mar 21, 9:30 AM | +#346 local baseline repro. PR count 349+. |
| Mar 21, 9:05 AM | +#347 WIP (LongContext4096 + Int4→16L, pending compute). |
| Mar 21, 4:34 AM | +#338 to full table. TTT spectrum (neutral on #315). P1 revised. Lineage. +fp16embed leaderboard. |
| Mar 21, 4:00 AM | Fix prediction numbering (1,2,4,3→1,2,3,4). +#334 hyperstack 1xH100. +#336 hypernetwork. |
| Mar 21, 3:02 AM | +#325 Looped Transformer on #315 stack (1.1462, untuned). |
| Mar 21, 2:50 AM | +#331 @Rhodrium (1.1487, 3s). +#330 @bopmite (Online Logit Bias). Lineage: @saml212→RoPE/LN/XSA/EMA. Depth vs width P4. |
| Mar 21, 2:30 AM | +#332 @saml212 12L + Gradient-Guided Quant (1.1320, validated). Novel: per-tensor int7/6/5. Late QAT negative at 12L. +#327-329 non-record. |
| Mar 21, 12:38 AM | +#321 optimizer-only baseline ref (1.1864, 3-seed). +#320/#323 recurrent motif. 323+ PRs. |
| Mar 21, 12:28 AM | +#318 Neural Cache, #319 depth recurrence negatives, #314 int4 pending. PR count 320+. |
| Mar 20, 11:32 PM | #315 deep dive added. +#316, #317 non-record. TTT pattern revised (helps weak bases). P1/P2 updated. |
| Mar 20, 11:20 PM | +#315 NEW BEST 1.1250! Partial RoPE + LN Scale + Late QAT on #287's base. +#317 (XSA+EMA+TTT, 1.1419, 1 seed). |
| Mar 20, 11:01 PM | +#313 non-record (1xA40, schedule tuning only). |
| Mar 20, 10:32 PM | +#312 Canon ACD layers (Allen-Zhu). PR count 313+. |
| Mar 20, 10:02 PM | +#310, #311 non-record. PR count updated to 310+. |
| Mar 20, 10:20 PM | Techniques: +Cautious WD, Gated Attention, Value Residual, Deep Delta Learning. Non-TTT paths. |
| Mar 20, 9:32 PM | +#306, #309, #310, #311 non-record. @dennisimoo→XSA+EMA. #307 batch test confounded by FA3. |
| Mar 20, 9:10 PM | Research: Moved dead-end TTT from Tier 1. Reptile on #287 is now Tier 1 #1. #254 warning strengthened. TTT v2 removed (moot post-#303). |
| Mar 20, 8:20 PM | Techniques: +PolyCom, PBS, Late-Stage SAM, nGPT. Non-TTT paths. |
| Mar 20, 8:10 PM | #303: TTT on XSA+EMA = +0.016 WORSE. Naive TTT dead end. Tier 1 closed. Predictions revised. |
| Mar 20, 8:00 PM | +#302 causal TTT + decay prior + Reptile (first Reptile adopter). |
| Mar 20, 7:35 PM | TTT ruling flagged. @0hq ruled #152 invalid (pre-eval TTT). #254 flagged. Only backward-looking TTT allowed. |
| Mar 20, 7:10 PM | Research: Tier ~1.130. P3 retired. Full leaderboard collapsed. KD estimate fixed. Negatives trimmed. |
| Mar 20, 6:40 PM | +XSA deep dive. Tier 3/4 boundary. Negative results header. Core five corrected. Glossary entries. |
| Mar 20, 6:25 PM | Techniques: +HybridNorm, Turbo-Muon, entropy-coded weights, VGA. |
| Mar 20, 6:15 PM | Research: Archived 3 deep dives. +#289, #295 non-record. EMA 0.999 negative. Tier 3/4 fixed. TTT ruling. Reptile lineage. |
| Mar 20, 5:40 PM | Research: Tier 4 added. TTT+XSA redundancy. Predictions pruned (7→4). Failure patterns. Lineage. |
| Mar 20, 5:30 PM | +#296 Reptile meta-TTT (0.011 on SmearGate, 10x naive). Error-guided TTT negative. |
| Mar 20, 5:18 PM | Research: +#76 validated (1.1468). Stale refs fixed. Δ nats footnote. Audit fixes. |
| Mar 20, 5:10 PM | +#290 (XSA+TTT, 1.1354). #194 over-limit flagged. Int5 lineage fix. EMA lineage. Paid prefix softened. |
| Mar 20, 5:00 PM | +#283 PPM-C (novel classical mixer). Updated PR count, timeframe. |
| Mar 20, 4:32 PM | +#287 1.1280 validated — NEW BEST! XSA + EMA (no TTT!). EMA negative result revised. |
| Mar 20, 3:15 PM | Research integration. Paid prefix ruled out. Audit fixes. |
| Mar 20, 2:42 PM | OFFICIAL LEADERBOARD UPDATED! New SOTA: 1.1428. #254 validated (1.1313). |
| Mar 20, 2:21 PM | +#267 1.1402 validated. +#265 (XSA 1.1307), +#262, +#264. |
| Mar 20, 1:12 PM | +#262 (opt. paid prefix, 1.0539). +#264 (Int5+TTT, 1.1455). |
| Mar 20, 12:21 PM | +#254 1.1303 (1 seed!) — TTT+SmearGate gains 0.014. LAWA negative result. |
| Mar 20, 10:52 AM | +#236 1.1400 validated (#2!). 524K batch = 0.017 BPB gain. |
| Mar 20, 10:11 AM | +#108 Error Correction Table (novel Tier 4 approach, est. ~1.05). |
| Mar 20, 9:42 AM | #180→1.1428 validated (#2!). BigramHash(10240). Tier analysis. |
| Mar 20, 8–10 AM | +#219, #220 (first SSM), #224 (first MoE), #76 validated. Table cleanup. TTT+SmearGate interaction. |
| Mar 20, 12–3 AM | +#198 (1.1326, major jump). +#180, #179, #194, #192, #191, #190, #186, #182, #76. |
| Mar 19, 4 PM–12 AM | Initial commentary through 174 PRs. Core technique deep dives. Validation tables. TTT+int6 confirmed. |
This commentary is generated by an AI (Claude) analyzing public PR data. No competition code is executed.