Skip to content

Commit b955b6d

Browse files
committed
Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)
Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.
1 parent 9d070df commit b955b6d

6 files changed

Lines changed: 886 additions & 0 deletions

File tree

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
## Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ (val_bpb: 1.0929)
2+
3+
**val_bpb = 1.0929** (3-seed mean, std 0.0009) | **2.5145 nats** | **~15.96 MB** | 8xH100 SXM, 600s train + ~83s eval | No TTT
4+
5+
Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult + 0.085-WD).
6+
7+
Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> this (1.0929)
8+
9+
### Changes from PR #1218
10+
11+
| | PR #1218 | This |
12+
|---|---|---|
13+
| val_bpb | 1.09785 | **1.09290** |
14+
| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) |
15+
| Depth recurrence | None | **Layers 4,5 repeated** (RECUR_LAYERS=4,5) |
16+
| Recurrence MLP sharing | N/A | **Fully shared** (REPEAT_UNTIE_MLP=none) |
17+
| Mixed quantization | No | **Yes** (60 int6 + 6 int5 via Hessian sensitivity) |
18+
| Recurrence activation | N/A | Step 3000 with 20-step warmup |
19+
| Everything else | Same | Same |
20+
21+
### What's New
22+
23+
1. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization in the Muon optimizer. Improves conditioning of the NS5 iteration for non-square weight matrices. Zero-byte cost, ~0.001 BPB improvement.
24+
25+
2. **Depth Recurrence** — Layers 4 and 5 are repeated once after the initial forward pass (virtual layers 12-13 on top of 11 physical layers). MLP weights are fully shared during recurrence (REPEAT_UNTIE_MLP=none), so this adds zero extra parameters. Activated at step 3000 with a 20-step linear warmup. ~0.003 BPB improvement.
26+
27+
3. **Mixed Int5/Int6 GPTQ** — Hessian-based sensitivity ranking determines which layers get int6 (clip_range=31) vs int5 (clip_range=15). The 60 most sensitive layers keep int6 precision; the 6 least sensitive get int5 to save artifact bytes. Combined with full GPTQ and brotli-11 compression.
28+
29+
### Carried from PR #1218
30+
31+
- 4096 SentencePiece BPE vocabulary
32+
- 4.0x MLP multiplier with sigmoid-gated activation
33+
- Weight decay 0.085 (high WD for better compression)
34+
- Full Hessian GPTQ quantization
35+
- XSA-all-11 attention pattern
36+
- BigramHash embedding (2816x160)
37+
- Sigmoid-gated skip connections
38+
- Soft-round QAT
39+
- Split-LR training
40+
- Brotli-11 compression with byte shuffle
41+
- EMA (decay 0.997)
42+
43+
### Configuration
44+
45+
```bash
46+
NCCL_NET=Socket \
47+
DATA_DIR=./data \
48+
SEED=1337 \
49+
MIXED_QUANT=1 \
50+
N_INT6_LAYERS=60 \
51+
RECUR_LAYERS=4,5 \
52+
torchrun --standalone --nproc_per_node=8 train_gpt.py
53+
```
54+
55+
## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT)
56+
57+
### Core Results
58+
59+
| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact |
60+
|------|-------|---------|--------------|-------------|-----------------|----------|
61+
| 1337 | 5,541 | 106.5 | 1.1000 | 1.0939 | 2.51667 | 15,933,457 |
62+
| 42 | 5,530 | 106.7 | 1.0987 | 1.0922 | 2.51279 | 15,981,324 |
63+
| 0 | 5,543 | 106.5 | 1.0988 | 1.0927 | 2.51394 | 15,960,050 |
64+
| **Mean** | **5,538** | **106.6** | **1.0992** | **1.0929** | **2.51447** | **15,958,277** |
65+
66+
### Supplemental Diagnostics
67+
68+
| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time |
69+
|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------|
70+
| 1337 | 1.1000 | 1.1122 | 1.0939 | 2.51667 | 21,084 | 15,933,457 | 590s | 83s |
71+
| 42 | 1.0987 | 1.1106 | 1.0922 | 2.51279 | 21,084 | 15,981,324 | 590s | 83s |
72+
| 0 | 1.0988 | 1.1113 | 1.0927 | 2.51394 | 21,084 | 15,960,050 | 590s | 83s |
73+
| **Mean** | **1.0992** | **1.1114** | **1.0929** | **2.51447** | **21,084** | **15,958,277** | **590s** | **83s** |
74+
75+
### Rule Compliance
76+
77+
- No TTT (no test-time training or adaptation)
78+
- No SLOT (no scored-position lookup table)
79+
- No validation data during training
80+
- No training data during evaluation
81+
- Artifact < 16,000,000 bytes for ALL seeds (max: 15,981,324)
82+
- Train < 600s on 8xH100 SXM (590s)
83+
- Eval < 600s on 8xH100 SXM (~83s)
84+
85+
### Architecture
86+
87+
- 11 layers + 2 virtual (depth recurrence on layers 4,5)
88+
- d_model = 512, MLP 4x (2048), 4 heads
89+
- 4096 SentencePiece BPE vocabulary
90+
- BigramHash(2816x160) token embedding
91+
- Sigmoid-gated skip connections with soft-round QAT
92+
- MuonEq-R optimizer with row normalization
93+
- Full Hessian GPTQ (int6) with mixed int5/int6 via sensitivity ranking
94+
95+
### Requirements
96+
97+
- PyTorch 2.9.1+cu128
98+
- flash-attn 2.8.3
99+
- sentencepiece
100+
- brotli
101+
- 8x H100 SXM 80GB
102+
103+
### Run Command (3-seed loop)
104+
105+
```bash
106+
for SEED in 1337 42 0; do
107+
NCCL_NET=Socket \
108+
DATA_DIR=./data \
109+
SEED=$SEED \
110+
MIXED_QUANT=1 \
111+
N_INT6_LAYERS=60 \
112+
RECUR_LAYERS=4,5 \
113+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
114+
2>&1 | tee train_seed${SEED}.log
115+
done
116+
```
117+
118+
### Lineage
119+
120+
PR #1019 (ValCalib + GPTQ + XSA + BigramHash, 1.1147) -> PR #1218 (4096-Vocab + MLP 4x + WD 0.085, 1.0979) -> this (MuonEq-R + Depth Recurrence + Mixed Quant, 1.0929)
121+
122+
### Credits
123+
124+
- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture — the foundation)
125+
- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline)
126+
- @msisovic for PR #1204 (depth recurrence concept)
127+
- MuonEq-R inspired by equalized gradient normalization literature
128+
129+
### Included Files
130+
131+
- `train_gpt.py` — full training + quantization + evaluation script (21,084 bytes, self-extracting)
132+
- `train_seed1337.log`, `train_seed42.log`, `train_seed0.log` — all seed logs
133+
- `submission.json` — leaderboard metadata
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"name": "Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ",
3+
"val_bpb": 1.0929,
4+
"bytes_total": 15981324,
5+
"blurb": "Adds MuonEq-R optimizer (row-norm before NS5), depth recurrence (layers 4,5 repeated with shared MLP), and mixed int5/int6 GPTQ to PR #1218's 4096-vocab high-WD stack. 3-seed mean 1.0929 BPB, all seeds under 16MB.",
6+
"author": "dexhunter",
7+
"github_id": "dexhunter",
8+
"date": "2026-04-02",
9+
"pre_quant_val_bpb": 1.0992,
10+
"bytes_model_compressed": 15960240,
11+
"bytes_code": 21084,
12+
"base_pr": 1218,
13+
"seeds": [1337, 42, 0],
14+
"seed_scores": [1.09386, 1.09217, 1.09267],
15+
"eval_time_seconds": [83, 83, 83]
16+
}

0 commit comments

Comments
 (0)