Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) by clarkkev · Pull Request #1218 · openai/parameter-golf

clarkkev · 2026-04-01T11:54:49Z

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785

val bpb: 1.09785 (3-seed mean, std=0.0004)

Seed	Steps	Pre-quant BPB	Post-quant BPB	Sliding BPB	Artifact
42	5967	1.10411	1.11588	1.09744	15,915,268
1337	5962	1.10482	1.11631	1.09795	15,905,460
2025	5961	1.10507	1.11641	1.09816	15,927,782
Mean		1.10467	1.11620	1.09785	15,916,170

Overview

This script builds on the 03-23 leaderboard record. The main changes are:

Fixes

Fixed a small bug in the sliding window evaluation causing it to score tokens at the end of the val dataset multiple times. This bug didn't significantly affect results: it added roughly 2k duplicate contributions to the total loss and byte counts over a validation set of about 6M tokens. The faulty line was:
window_starts = [ws for ws in range(0, total_tokens, stride) if min(ws + seq_len, total_tokens) - ws >= 1], and it should be:
window_starts = [ws for ws in range(0, total_tokens, stride) if ws + seq_len - stride < total_tokens]

Simplifications

Use XSA in all layers instead of only the last 4.
Removed parameter banking and distributed muon implementation and instead just used Muon + DDP.
Removed test time training. I doubt that 0.1% additional tokens will improve the model generally, and for long docs I think it makes more sense to work on extending the sequence length.
Removed quantization-aware training, since it appeared to provide little or no benefit.
Removed gated attention.
Removed value residuals.
Removed hash embeddings, which are probably less necessary after increasing the vocab size.
Removed the smear gate, for the same reason.

Additions

Increased the vocabulary size from 1024 to 4096. I used the existing data/download_hf_docs_and_tokenize.py to build the sentencepiece tokenizer and pre-tokenized data. The tokenizer model grew by ~50kb, but even with that added, the final artifacts would be below the 16MB cap. A larger vocab means the model sees more context for the same sequence length and more train data per step.
Use a bigger but more strongly regularized model. I discovered that the compression ratio of a weight matrix (i.e., quantized-and-compressed-mb / raw-mb) correlates extremely well with the matrix's root-mean-square (torch.sqrt(torch.mean(x**2))) with an R^2 near 0.99. This suggests that the weight decay is a good lever for reducing the compressed size, which can let us add more parameters to the model. In particular this script uses:
- Higher weight decays: muon weight decay increased 0.04 -> 0.085, and added an embeddings weight decay of 0.085. Additionally, decreased the adam weight decay 0.04 -> 0.02, as scalar parameters shouldn't need to be low-magnitude.
- Wider MLPs, increasing mlp_mult 3 -> 4.
- A decreased learning rate 0.025 -> 0.02, as larger models generally benefit from smaller LRs.
Added the coprime-stride data loader from #726. The benefit is that it avoids showing the model sequences from the same document in the same/nearby minibatches by jumping around the data files.
Added GPTQ Hessian-aware quantization. My implementation is based on #1060 and reserves some time from training for Hessian computation.
Use more efficient byte shuffle + brotli compression from #1089.
Added sigmoid-gated skip connections to the unet, also from #1089.
Increased qk_gain_init 1.5 -> 4 following #1125.

…pb 1.09785 (3-seed mean)

mikeapedia · 2026-04-01T22:52:45Z

Awesome results @clarkkev!

Strip complexity, bigger model, higher weight decay: MLP 3x → 4x (32.2M params vs 27M) MUON_WD 0.04 → 0.085 (better int6 compression) ADAM_WD 0.04 → 0.02 (scalars) BigramHash removed, VE removed QK_GAIN_INIT=4.0 PR openai#1218 proved this approach works: simplify + regularize = 1.098 on sp4096. On Scylla (998 tokens): should fit ~15.9MB at high WD.

abaybektursun · 2026-04-02T02:46:59Z

so elegant, thing of beauty my friend.
How did you discover that the compression ratio correlates with the matrix's root-mean-square?
In general what did you test and what experiments did you conduct before this?
Because there were a lot of great decisions made and it's not obvious how you made them. Is it just established expertise? Would you mind sharing? Thanks!

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_b…

4421a8d

…pb 1.09785 (3-seed mean)

clarkkev changed the title ~~4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)~~ Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) Apr 1, 2026

bigbag mentioned this pull request Apr 1, 2026

Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217

Open

clarkkev mentioned this pull request Apr 1, 2026

Tokenizer Request: larger variants #1189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)#1218

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)#1218
clarkkev wants to merge 1 commit intoopenai:mainfrom
clarkkev:submission/vocab4096-mlpmult4-wd085

clarkkev commented Apr 1, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 1, 2026

Uh oh!

abaybektursun commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

clarkkev commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785

Overview

Fixes

Simplifications

Additions

Uh oh!

mikeapedia commented Apr 1, 2026

Uh oh!

abaybektursun commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clarkkev commented Apr 1, 2026 •

edited

Loading

abaybektursun commented Apr 2, 2026 •

edited

Loading