openai · clarkkev · Apr 1, 2026
diff --git a/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/README.md b/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/README.md
@@ -0,0 +1,87 @@
+# Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785
+
+**val bpb: 1.09785** (3-seed mean, std=0.0004)
+
+| Seed | Steps | Pre-quant BPB | Post-quant BPB | **Sliding BPB** | Artifact |
+|-|-|-|-|-|-|
+| 42   | 5967  | 1.10411 | 1.11588 | **1.09744** | 15,915,268 |
+| 1337 | 5962  | 1.10482 | 1.11631 | **1.09795** | 15,905,460 |
+| 2025 | 5961  | 1.10507 | 1.11641 | **1.09816** | 15,927,782 |
+| **Mean** | | 1.10467 | 1.11620 | **1.09785** | 15,916,170 |
+
+## Overview
+
+This script builds on the 03-23 leaderboard [record](https://github.com/openai/parameter-golf/blob/main/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.md). The main changes are:
+
+### Fixes
+* Fixed a small bug in the sliding window evaluation causing it to score tokens at the end of the val dataset multiple times. This bug didn't significantly affect results: it added roughly 2k duplicate contributions to the total loss and byte counts over a validation set of about 6M tokens. The faulty line was:
+ `window_starts = [ws for ws in range(0, total_tokens, stride) if min(ws + seq_len, total_tokens) - ws >= 1]`, and it should be:
+ `window_starts = [ws for ws in range(0, total_tokens, stride) if  ws + seq_len - stride  <  total_tokens]`
+
+### Simplifications
+* Use XSA in all layers instead of only the last 4.
+* Removed parameter banking and distributed muon implementation and instead just used Muon + DDP.
+* Removed test time training. I doubt that 0.1% additional tokens will improve the model
+  generally, and for long docs I think it makes more sense to work on extending the sequence length.
+* Removed quantization-aware training, since it appeared to provide little or no benefit.
+* Removed gated attention.
+* Removed value residuals.
+* Removed hash embeddings, which are probably less necessary after increasing the vocab size.
+* Removed the smear gate, for the same reason.
+
+### Additions
+* Increased the vocabulary size from 1024 to 4096. I used the existing `data/download_hf_docs_and_tokenize.py` to build the sentencepiece tokenizer and pre-tokenized data. The tokenizer model grew by ~50kb, but even with that added, the final artifacts are below the 16MB cap. A larger vocab means the model sees more context for the same sequence length and more train data per step.
+* Use a bigger but more strongly regularized model. I discovered that the compressibility of a weight matrix (i.e., quantized-and-compressed-mb / raw-mb) correlates extremely well with the matrix's root-mean-square (`torch.sqrt(torch.mean(x**2))`) with an R^2 near 0.99. This suggests that the weight decay is a good lever for reducing the compressed size, which can let us add more parameters to the model. In particular this script uses:
+  * Higher weight decays: muon weight decay increased 0.04 -> 0.085, and added an embeddings weight decay of 0.085. Additionally, decreased the adam weight decay 0.04 -> 0.02, as scalar parameters shouldn't need to be low-magnitude.
+  * Wider MLPs, increasing `mlp_mult` 3 -> 4.
+  * A decreased learning rate 0.025 -> 0.02, as larger models generally benefit from smaller LRs.
+* Added the coprime-stride data loader from [#726](https://github.com/openai/parameter-golf/pull/726). The benefit is that it avoids showing the model sequences from the same document in the same/nearby minibatches by jumping around the data files.
+* Added GPTQ Hessian-aware quantization. My implementation is based on [#1060](https://github.com/openai/parameter-golf/pull/1060) and reserves some time from training for Hessian computation.
+* Use more efficient byte shuffle + brotli compression from [#1089](https://github.com/openai/parameter-golf/pull/1089).
+* Added sigmoid-gated skip connections to the unet, also from [#1089](https://github.com/openai/parameter-golf/pull/1089).
+* Increased `qk_gain_init` 1.5 -> 4 following [#1125](https://github.com/openai/parameter-golf/pull/1125).
+
+## Requirements
+
+Flash Attention 3 (Hopper) is required. The script imports `flash_attn_interface` directly and was run with PyTorch 2.11.0+cu130. Install commands:
+
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cu130
+pip install --no-cache-dir \
+  "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl"
+pip install -r requirements.txt
+```
+
+The tokenizer and pre-tokenized data (sp4096) is available on my [HuggingFace](https://huggingface.co/datasets/kevclark/parameter-golf). You can download it with:
+
+```bash
+rm -f data/manifest.json
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+  python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
+```
+
+Note this first deletes any existing `data/manifest.json` because the download script caches the manifest locally, and a stale one from the default repo won't include sp4096. Alternatively, to regenerate the tokenizer and dataset from scratch:
+
+```bash
+cat > data/tokenizer_specs_4096.json << 'EOF'
+[
+  {
+    "name": "sp_bpe_4096",
+    "kind": "sentencepiece_bpe",
+    "vocab_size": 4096,
+    "tokenizer_train_docs": 5000000
+  }
+]
+EOF
+python3 data/download_hf_docs_and_tokenize.py \
+  --output-root data \
+  --tokenizer-config data/tokenizer_specs_4096.json \
+  --skip-byte
+```
+
+## Run Command
+
+```bash
+RUN_ID=1337 SEED=1337 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/requirements.txt b/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/requirements.txt
@@ -0,0 +1,6 @@
+# torch and flash-attn-3 are installed separately in setup.sh
+brotli
+huggingface-hub
+numpy
+sentencepiece
+tqdm
diff --git a/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/submission.json b/records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/submission.json
@@ -0,0 +1,36 @@
+{
+  "author": "Kevin Clark",
+  "github_id": "clarkkev",
+  "name": "4096-Vocab + Larger Model + High WD + Simplifications",
+  "blurb": "Vocab 4096, MLP 4x, WD 0.085, co-prime data loader, GPTQ, brotli, sigmoid-gated UNet skips, simplified architecture",
+  "date": "2026-04-01",
+  "track": "10min_16mb",
+  "val_loss": 2.52618,
+  "val_bpb": 1.09785,
+  "val_bpb_std": 0.0004,
+  "seeds": [42, 1337, 2025],
+  "seed_results": {
+    "42": {
+      "val_loss": 2.52524,
+      "val_bpb": 1.09744,
+      "artifact_bytes": 15915268,
+      "steps": 5967
+    },
+    "1337": {
+      "val_loss": 2.52641,
+      "val_bpb": 1.09795,
+      "artifact_bytes": 15905460,
+      "steps": 5962
+    },
+    "2025": {
+      "val_loss": 2.52690,
+      "val_bpb": 1.09816,
+      "artifact_bytes": 15927782,
+      "steps": 5961
+    }
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.11.0+cu130",
+  "bytes_total": 15916170,
+  "bytes_code": 68206
+}