Unofficial Leaderboard

# Parameter Golf Leaderboard

> *Open PRs snapshot — 2026-03-19 | Track: 10min 8xH100, 16MB cap | Estimated BPB excluded*

## Top 5 Memorization (val-only training)

| Rank | val_bpb | PR | Author | Size (bytes) | Key Techniques |
|:----:|--------:|:---|:-------|-------------:|:---------------|
| 1 | **1.0149** | [#64](https://github.com/openai/parameter-golf/pull/64) Combined Optimal | yesbhautik | 15,542,354 | Val-only training, mixed int8/int6, sliding window stride=64, seq4096, Muon, 10 layers |
| 2 | **1.1111** | [#44](https://github.com/openai/parameter-golf/pull/44) val-only 10min record | daniellawson9999 | — | Val-only training |

You could be next. (Please don't 🥲)

## Top 5 (standard training)

| Rank | val_bpb | PR | Author | Size (bytes) | Key Techniques |
|:----:|--------:|:---|:-------|-------------:|:---------------|
| 1 | **1.1630** | [#65](https://github.com/openai/parameter-golf/pull/65) Mixed Quant + Sliding Window | aquariouseworkman | 15,353,490 | MLP 3x, mixed int6/int8, sliding window stride=64, seq1024, batch 524K |
| 2 | **1.1652** | [#66](https://github.com/openai/parameter-golf/pull/66) ArjunAutoResearch | arjun-krishna1 | 15,619,929 | MLP 3x, int6, seq4096, sliding window, Muon, AI-composed |
| 3 | **1.1659** | [#70](https://github.com/openai/parameter-golf/pull/70) Wider MLP 3x + int6 | jfprincz | 14,855,508 | MLP 3x (h=1536), int6 per-row, sliding window stride=256, zstd-22 |
| 4 | **1.1768** | [#75](https://github.com/openai/parameter-golf/pull/75) seq4096 sliding-window fp16 | takhir-iota | 15,943,260 | seq4096, sliding window stride=64, fp16 tok_emb, coarsen blocks.5 |
| 5 | **1.1793** | [#61](https://github.com/openai/parameter-golf/pull/61) Long-context sliding window | saml212 | ~15,880,000 | seq4096, sliding window stride=512, high Muon momentum |

## Winning Techniques

| Technique | BPB Impact | Originated / Best Demonstrated In | Description |
|:----------|:----------:|:-----------------------------------|:------------|
| Sliding window eval | ~-0.034 | [#50](https://github.com/openai/parameter-golf/pull/50) @mattqlf (first), [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman (stride=64 best) | Overlapping context windows at eval time; every scored token gets near-full context. Zero artifact cost. |
| MLP 3x expansion | ~-0.019 | [#70](https://github.com/openai/parameter-golf/pull/70) @jfprincz (first), [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman | 3x feedforward expansion (hidden=1536) adds capacity; enabled by int6 quant freeing ~4MB. |
| int6 per-row quantization | saves ~4MB | [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman, [#70](https://github.com/openai/parameter-golf/pull/70) @jfprincz | 31-level per-row quant on MLP+attention; only +0.001–0.010 BPB vs fp16. zstd-22 compresses zero high bits. |
| fp16 tied embedding | ~-0.007 | [#42](https://github.com/openai/parameter-golf/pull/42) @chonchiog (first), [#66](https://github.com/openai/parameter-golf/pull/66) @arjun-krishna1 | Embedding/output-head is most quant-sensitive tensor; fp16 passthrough costs ~523KB but saves significant BPB. |
| Long-context training (seq4096) | ~-0.01 | [#61](https://github.com/openai/parameter-golf/pull/61) @saml212 (first), [#66](https://github.com/openai/parameter-golf/pull/66) @arjun-krishna1 | 4x longer sequences match sliding window eval distribution. ~64ms/step vs ~48ms at seq1024 but quality compensates. |
| Muon momentum=0.99 + low LR | ~-0.005 | [#52](https://github.com/openai/parameter-golf/pull/52) @spokane-way, [#61](https://github.com/openai/parameter-golf/pull/61) @saml212 | Smoother optimization reduces quant gap. LR=0.02, warmdown=3000, momentum warmup from 0.92. |
| Vocab 8192 + NorMuon | novel | [#78](https://github.com/openai/parameter-golf/pull/78) @mtybadger | Custom 8192-token SentencePiece tokenizer + NorMuon optimizer. Trades 1 layer for richer tokenization. |
| LoRA TTT (test-time training) | ~-0.004 | [#77](https://github.com/openai/parameter-golf/pull/77) @samacqua | Rank-8 LoRA adapters trained per-document at eval time; doc-isolated + sliding window. Uses ~1/10 eval budget. |
| 10-layer mixed precision | ~-0.01 | [#39](https://github.com/openai/parameter-golf/pull/39) @nanlliu (first), [#64](https://github.com/openai/parameter-golf/pull/64) @yesbhautik | Extra layer for capacity; middle layers (3-7) at int6, outer layers at int8, to fit 16MB. |

---

*22 total record-claiming PRs surveyed across 81 open PRs.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unofficial Leaderboard #83

Parameter Golf Leaderboard

Top 5 Memorization (val-only training)

Top 5 (standard training)

Winning Techniques

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rank	val_bpb	PR	Author	Size (bytes)	Key Techniques
1	1.0149	#64 Combined Optimal	yesbhautik	15,542,354	Val-only training, mixed int8/int6, sliding window stride=64, seq4096, Muon, 10 layers
2	1.1111	#44 val-only 10min record	daniellawson9999	—	Val-only training

Rank	val_bpb	PR	Author	Size (bytes)	Key Techniques
1	1.1630	#65 Mixed Quant + Sliding Window	aquariouseworkman	15,353,490	MLP 3x, mixed int6/int8, sliding window stride=64, seq1024, batch 524K
2	1.1652	#66 ArjunAutoResearch	arjun-krishna1	15,619,929	MLP 3x, int6, seq4096, sliding window, Muon, AI-composed
3	1.1659	#70 Wider MLP 3x + int6	jfprincz	14,855,508	MLP 3x (h=1536), int6 per-row, sliding window stride=256, zstd-22
4	1.1768	#75 seq4096 sliding-window fp16	takhir-iota	15,943,260	seq4096, sliding window stride=64, fp16 tok_emb, coarsen blocks.5
5	1.1793	#61 Long-context sliding window	saml212	~15,880,000	seq4096, sliding window stride=512, high Muon momentum

Technique	BPB Impact	Originated / Best Demonstrated In	Description
Sliding window eval	~-0.034	#50 @mattqlf (first), #65 @aquariouseworkman (stride=64 best)	Overlapping context windows at eval time; every scored token gets near-full context. Zero artifact cost.
MLP 3x expansion	~-0.019	#70 @jfprincz (first), #65 @aquariouseworkman	3x feedforward expansion (hidden=1536) adds capacity; enabled by int6 quant freeing ~4MB.
int6 per-row quantization	saves ~4MB	#65 @aquariouseworkman, #70 @jfprincz	31-level per-row quant on MLP+attention; only +0.001–0.010 BPB vs fp16. zstd-22 compresses zero high bits.
fp16 tied embedding	~-0.007	#42 @chonchiog (first), #66 @arjun-krishna1	Embedding/output-head is most quant-sensitive tensor; fp16 passthrough costs ~523KB but saves significant BPB.
Long-context training (seq4096)	~-0.01	#61 @saml212 (first), #66 @arjun-krishna1	4x longer sequences match sliding window eval distribution. ~64ms/step vs ~48ms at seq1024 but quality compensates.
Muon momentum=0.99 + low LR	~-0.005	#52 @spokane-way, #61 @saml212	Smoother optimization reduces quant gap. LR=0.02, warmdown=3000, momentum warmup from 0.92.
Vocab 8192 + NorMuon	novel	#78 @mtybadger	Custom 8192-token SentencePiece tokenizer + NorMuon optimizer. Trades 1 layer for richer tokenization.
LoRA TTT (test-time training)	~-0.004	#77 @samacqua	Rank-8 LoRA adapters trained per-document at eval time; doc-isolated + sliding window. Uses ~1/10 eval budget.
10-layer mixed precision	~-0.01	#39 @nanlliu (first), #64 @yesbhautik	Extra layer for capacity; middle layers (3-7) at int6, outer layers at int8, to fit 16MB.

Unofficial Leaderboard #83

Description

Parameter Golf Leaderboard

Top 5 Memorization (val-only training)

Top 5 (standard training)

Winning Techniques

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions