-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Description
Parameter Golf Leaderboard
Open PRs snapshot — 2026-03-19 | Track: 10min 8xH100, 16MB cap | Estimated BPB excluded
Top 5 Memorization (val-only training)
| Rank | val_bpb | PR | Author | Size (bytes) | Key Techniques |
|---|---|---|---|---|---|
| 1 | 1.0149 | #64 Combined Optimal | yesbhautik | 15,542,354 | Val-only training, mixed int8/int6, sliding window stride=64, seq4096, Muon, 10 layers |
| 2 | 1.1111 | #44 val-only 10min record | daniellawson9999 | — | Val-only training |
You could be next. (Please don't 🥲)
Top 5 (standard training)
| Rank | val_bpb | PR | Author | Size (bytes) | Key Techniques |
|---|---|---|---|---|---|
| 1 | 1.1630 | #65 Mixed Quant + Sliding Window | aquariouseworkman | 15,353,490 | MLP 3x, mixed int6/int8, sliding window stride=64, seq1024, batch 524K |
| 2 | 1.1652 | #66 ArjunAutoResearch | arjun-krishna1 | 15,619,929 | MLP 3x, int6, seq4096, sliding window, Muon, AI-composed |
| 3 | 1.1659 | #70 Wider MLP 3x + int6 | jfprincz | 14,855,508 | MLP 3x (h=1536), int6 per-row, sliding window stride=256, zstd-22 |
| 4 | 1.1768 | #75 seq4096 sliding-window fp16 | takhir-iota | 15,943,260 | seq4096, sliding window stride=64, fp16 tok_emb, coarsen blocks.5 |
| 5 | 1.1793 | #61 Long-context sliding window | saml212 | ~15,880,000 | seq4096, sliding window stride=512, high Muon momentum |
Winning Techniques
| Technique | BPB Impact | Originated / Best Demonstrated In | Description |
|---|---|---|---|
| Sliding window eval | ~-0.034 | #50 @mattqlf (first), #65 @aquariouseworkman (stride=64 best) | Overlapping context windows at eval time; every scored token gets near-full context. Zero artifact cost. |
| MLP 3x expansion | ~-0.019 | #70 @jfprincz (first), #65 @aquariouseworkman | 3x feedforward expansion (hidden=1536) adds capacity; enabled by int6 quant freeing ~4MB. |
| int6 per-row quantization | saves ~4MB | #65 @aquariouseworkman, #70 @jfprincz | 31-level per-row quant on MLP+attention; only +0.001–0.010 BPB vs fp16. zstd-22 compresses zero high bits. |
| fp16 tied embedding | ~-0.007 | #42 @chonchiog (first), #66 @arjun-krishna1 | Embedding/output-head is most quant-sensitive tensor; fp16 passthrough costs ~523KB but saves significant BPB. |
| Long-context training (seq4096) | ~-0.01 | #61 @saml212 (first), #66 @arjun-krishna1 | 4x longer sequences match sliding window eval distribution. ~64ms/step vs ~48ms at seq1024 but quality compensates. |
| Muon momentum=0.99 + low LR | ~-0.005 | #52 @spokane-way, #61 @saml212 | Smoother optimization reduces quant gap. LR=0.02, warmdown=3000, momentum warmup from 0.92. |
| Vocab 8192 + NorMuon | novel | #78 @mtybadger | Custom 8192-token SentencePiece tokenizer + NorMuon optimizer. Trades 1 layer for richer tokenization. |
| LoRA TTT (test-time training) | ~-0.004 | #77 @samacqua | Rank-8 LoRA adapters trained per-document at eval time; doc-isolated + sliding window. Uses ~1/10 eval budget. |
| 10-layer mixed precision | ~-0.01 | #39 @nanlliu (first), #64 @yesbhautik | Extra layer for capacity; middle layers (3-7) at int6, outer layers at int8, to fit 16MB. |
22 total record-claiming PRs surveyed across 81 open PRs.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels