Skip to content

Unofficial Leaderboard #83

@jordankzf

Description

@jordankzf

Parameter Golf Leaderboard

Open PRs snapshot — 2026-03-19 | Track: 10min 8xH100, 16MB cap | Estimated BPB excluded

Top 5 Memorization (val-only training)

Rank val_bpb PR Author Size (bytes) Key Techniques
1 1.0149 #64 Combined Optimal yesbhautik 15,542,354 Val-only training, mixed int8/int6, sliding window stride=64, seq4096, Muon, 10 layers
2 1.1111 #44 val-only 10min record daniellawson9999 Val-only training

You could be next. (Please don't 🥲)

Top 5 (standard training)

Rank val_bpb PR Author Size (bytes) Key Techniques
1 1.1630 #65 Mixed Quant + Sliding Window aquariouseworkman 15,353,490 MLP 3x, mixed int6/int8, sliding window stride=64, seq1024, batch 524K
2 1.1652 #66 ArjunAutoResearch arjun-krishna1 15,619,929 MLP 3x, int6, seq4096, sliding window, Muon, AI-composed
3 1.1659 #70 Wider MLP 3x + int6 jfprincz 14,855,508 MLP 3x (h=1536), int6 per-row, sliding window stride=256, zstd-22
4 1.1768 #75 seq4096 sliding-window fp16 takhir-iota 15,943,260 seq4096, sliding window stride=64, fp16 tok_emb, coarsen blocks.5
5 1.1793 #61 Long-context sliding window saml212 ~15,880,000 seq4096, sliding window stride=512, high Muon momentum

Winning Techniques

Technique BPB Impact Originated / Best Demonstrated In Description
Sliding window eval ~-0.034 #50 @mattqlf (first), #65 @aquariouseworkman (stride=64 best) Overlapping context windows at eval time; every scored token gets near-full context. Zero artifact cost.
MLP 3x expansion ~-0.019 #70 @jfprincz (first), #65 @aquariouseworkman 3x feedforward expansion (hidden=1536) adds capacity; enabled by int6 quant freeing ~4MB.
int6 per-row quantization saves ~4MB #65 @aquariouseworkman, #70 @jfprincz 31-level per-row quant on MLP+attention; only +0.001–0.010 BPB vs fp16. zstd-22 compresses zero high bits.
fp16 tied embedding ~-0.007 #42 @chonchiog (first), #66 @arjun-krishna1 Embedding/output-head is most quant-sensitive tensor; fp16 passthrough costs ~523KB but saves significant BPB.
Long-context training (seq4096) ~-0.01 #61 @saml212 (first), #66 @arjun-krishna1 4x longer sequences match sliding window eval distribution. ~64ms/step vs ~48ms at seq1024 but quality compensates.
Muon momentum=0.99 + low LR ~-0.005 #52 @spokane-way, #61 @saml212 Smoother optimization reduces quant gap. LR=0.02, warmdown=3000, momentum warmup from 0.92.
Vocab 8192 + NorMuon novel #78 @mtybadger Custom 8192-token SentencePiece tokenizer + NorMuon optimizer. Trades 1 layer for richer tokenization.
LoRA TTT (test-time training) ~-0.004 #77 @samacqua Rank-8 LoRA adapters trained per-document at eval time; doc-isolated + sliding window. Uses ~1/10 eval budget.
10-layer mixed precision ~-0.01 #39 @nanlliu (first), #64 @yesbhautik Extra layer for capacity; middle layers (3-7) at int6, outer layers at int8, to fit 16MB.

22 total record-claiming PRs surveyed across 81 open PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions