|
| 1 | +## Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ (val_bpb: 1.0929) |
| 2 | + |
| 3 | +**val_bpb = 1.0929** (3-seed mean, std 0.0009) | **2.5145 nats** | **~15.96 MB** | 8xH100 SXM, 600s train + ~83s eval | No TTT |
| 4 | + |
| 5 | +Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult + 0.085-WD). |
| 6 | + |
| 7 | +Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> this (1.0929) |
| 8 | + |
| 9 | +### Changes from PR #1218 |
| 10 | + |
| 11 | +| | PR #1218 | This | |
| 12 | +|---|---|---| |
| 13 | +| val_bpb | 1.09785 | **1.09290** | |
| 14 | +| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) | |
| 15 | +| Depth recurrence | None | **Layers 4,5 repeated** (RECUR_LAYERS=4,5) | |
| 16 | +| Recurrence MLP sharing | N/A | **Fully shared** (REPEAT_UNTIE_MLP=none) | |
| 17 | +| Mixed quantization | No | **Yes** (60 int6 + 6 int5 via Hessian sensitivity) | |
| 18 | +| Recurrence activation | N/A | Step 3000 with 20-step warmup | |
| 19 | +| Everything else | Same | Same | |
| 20 | + |
| 21 | +### What's New |
| 22 | + |
| 23 | +1. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization in the Muon optimizer. Improves conditioning of the NS5 iteration for non-square weight matrices. Zero-byte cost, ~0.001 BPB improvement. |
| 24 | + |
| 25 | +2. **Depth Recurrence** — Layers 4 and 5 are repeated once after the initial forward pass (virtual layers 12-13 on top of 11 physical layers). MLP weights are fully shared during recurrence (REPEAT_UNTIE_MLP=none), so this adds zero extra parameters. Activated at step 3000 with a 20-step linear warmup. ~0.003 BPB improvement. |
| 26 | + |
| 27 | +3. **Mixed Int5/Int6 GPTQ** — Hessian-based sensitivity ranking determines which layers get int6 (clip_range=31) vs int5 (clip_range=15). The 60 most sensitive layers keep int6 precision; the 6 least sensitive get int5 to save artifact bytes. Combined with full GPTQ and brotli-11 compression. |
| 28 | + |
| 29 | +### Carried from PR #1218 |
| 30 | + |
| 31 | +- 4096 SentencePiece BPE vocabulary |
| 32 | +- 4.0x MLP multiplier with sigmoid-gated activation |
| 33 | +- Weight decay 0.085 (high WD for better compression) |
| 34 | +- Full Hessian GPTQ quantization |
| 35 | +- XSA-all-11 attention pattern |
| 36 | +- BigramHash embedding (2816x160) |
| 37 | +- Sigmoid-gated skip connections |
| 38 | +- Soft-round QAT |
| 39 | +- Split-LR training |
| 40 | +- Brotli-11 compression with byte shuffle |
| 41 | +- EMA (decay 0.997) |
| 42 | + |
| 43 | +### Configuration |
| 44 | + |
| 45 | +```bash |
| 46 | +NCCL_NET=Socket \ |
| 47 | +DATA_DIR=./data \ |
| 48 | +SEED=1337 \ |
| 49 | +MIXED_QUANT=1 \ |
| 50 | +N_INT6_LAYERS=60 \ |
| 51 | +RECUR_LAYERS=4,5 \ |
| 52 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 53 | +``` |
| 54 | + |
| 55 | +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT) |
| 56 | + |
| 57 | +### Core Results |
| 58 | + |
| 59 | +| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact | |
| 60 | +|------|-------|---------|--------------|-------------|-----------------|----------| |
| 61 | +| 1337 | 5,541 | 106.5 | 1.1000 | 1.0939 | 2.51667 | 15,933,457 | |
| 62 | +| 42 | 5,530 | 106.7 | 1.0987 | 1.0922 | 2.51279 | 15,981,324 | |
| 63 | +| 0 | 5,543 | 106.5 | 1.0988 | 1.0927 | 2.51394 | 15,960,050 | |
| 64 | +| **Mean** | **5,538** | **106.6** | **1.0992** | **1.0929** | **2.51447** | **15,958,277** | |
| 65 | + |
| 66 | +### Supplemental Diagnostics |
| 67 | + |
| 68 | +| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time | |
| 69 | +|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------| |
| 70 | +| 1337 | 1.1000 | 1.1122 | 1.0939 | 2.51667 | 21,084 | 15,933,457 | 590s | 83s | |
| 71 | +| 42 | 1.0987 | 1.1106 | 1.0922 | 2.51279 | 21,084 | 15,981,324 | 590s | 83s | |
| 72 | +| 0 | 1.0988 | 1.1113 | 1.0927 | 2.51394 | 21,084 | 15,960,050 | 590s | 83s | |
| 73 | +| **Mean** | **1.0992** | **1.1114** | **1.0929** | **2.51447** | **21,084** | **15,958,277** | **590s** | **83s** | |
| 74 | + |
| 75 | +### Rule Compliance |
| 76 | + |
| 77 | +- No TTT (no test-time training or adaptation) |
| 78 | +- No SLOT (no scored-position lookup table) |
| 79 | +- No validation data during training |
| 80 | +- No training data during evaluation |
| 81 | +- Artifact < 16,000,000 bytes for ALL seeds (max: 15,981,324) |
| 82 | +- Train < 600s on 8xH100 SXM (590s) |
| 83 | +- Eval < 600s on 8xH100 SXM (~83s) |
| 84 | + |
| 85 | +### Architecture |
| 86 | + |
| 87 | +- 11 layers + 2 virtual (depth recurrence on layers 4,5) |
| 88 | +- d_model = 512, MLP 4x (2048), 4 heads |
| 89 | +- 4096 SentencePiece BPE vocabulary |
| 90 | +- BigramHash(2816x160) token embedding |
| 91 | +- Sigmoid-gated skip connections with soft-round QAT |
| 92 | +- MuonEq-R optimizer with row normalization |
| 93 | +- Full Hessian GPTQ (int6) with mixed int5/int6 via sensitivity ranking |
| 94 | + |
| 95 | +### Requirements |
| 96 | + |
| 97 | +- PyTorch 2.9.1+cu128 |
| 98 | +- flash-attn 2.8.3 |
| 99 | +- sentencepiece |
| 100 | +- brotli |
| 101 | +- 8x H100 SXM 80GB |
| 102 | + |
| 103 | +### Run Command (3-seed loop) |
| 104 | + |
| 105 | +```bash |
| 106 | +for SEED in 1337 42 0; do |
| 107 | + NCCL_NET=Socket \ |
| 108 | + DATA_DIR=./data \ |
| 109 | + SEED=$SEED \ |
| 110 | + MIXED_QUANT=1 \ |
| 111 | + N_INT6_LAYERS=60 \ |
| 112 | + RECUR_LAYERS=4,5 \ |
| 113 | + torchrun --standalone --nproc_per_node=8 train_gpt.py \ |
| 114 | + 2>&1 | tee train_seed${SEED}.log |
| 115 | +done |
| 116 | +``` |
| 117 | + |
| 118 | +### Lineage |
| 119 | + |
| 120 | +PR #1019 (ValCalib + GPTQ + XSA + BigramHash, 1.1147) -> PR #1218 (4096-Vocab + MLP 4x + WD 0.085, 1.0979) -> this (MuonEq-R + Depth Recurrence + Mixed Quant, 1.0929) |
| 121 | + |
| 122 | +### Credits |
| 123 | + |
| 124 | +- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture — the foundation) |
| 125 | +- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline) |
| 126 | +- @msisovic for PR #1204 (depth recurrence concept) |
| 127 | +- MuonEq-R inspired by equalized gradient normalization literature |
| 128 | + |
| 129 | +### Included Files |
| 130 | + |
| 131 | +- `train_gpt.py` — full training + quantization + evaluation script (21,084 bytes, self-extracting) |
| 132 | +- `train_seed1337.log`, `train_seed42.log`, `train_seed0.log` — all seed logs |
| 133 | +- `submission.json` — leaderboard metadata |
0 commit comments