Skip to content

Commit 99b080b

Browse files
committed
Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)
SLOT eval-time delta optimization + split early/late Muon LR + Full Hessian GPTQ int6 + sigmoid-gated skip connections + soft-round QAT + Brotli-11 + BigramHash(2816x160) + code minification. 3-seed mean: 1.1015 (std 0.0011), delta -0.0132 BPP / -0.0224 nats vs PR openai#1019.
1 parent 9d070df commit 99b080b

File tree

6 files changed

+707
-0
lines changed

6 files changed

+707
-0
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015)
2+
3+
**val_bpb: 1.1015** (3-seed mean, std 0.0011) | **1.8598 nats** | **~15.65 MB** | 8xH100 SXM, 600s train + 177s eval
4+
5+
Built on [PR #1019](https://github.com/openai/parameter-golf/pull/1019) by @abaybektursun.
6+
Previous: [PR #549](https://github.com/openai/parameter-golf/pull/549) (1.1194) -> [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> this.
7+
8+
## Results (8xH100 SXM)
9+
10+
| Seed | Steps | ms/step | Post-EMA BPB | **Sliding+SLOT BPB** | val_loss (nats) | Artifact |
11+
|------|-------|---------|-------------|---------------------|-----------------|----------|
12+
| 1337 | 6704 | 88.2 | 1.1309 | **1.10213** | 1.8609 | 15,647,124 |
13+
| 42 | 6706 | 88.2 | 1.1289 | **1.10019** | 1.8576 | 15,658,061 |
14+
| 2025 | 6684 | 88.4 | 1.1310 | **1.10216** | 1.8609 | 15,650,266 |
15+
| **Mean** | **6698** | **88.3** | **1.1303** | **1.10149** | **1.8598** | **15,651,817** |
16+
17+
### Improvement vs SOTA
18+
19+
| Metric | Merged SOTA (PR #1019) | This submission | Delta |
20+
|--------|----------------------|-----------------|-------|
21+
| val_bpb (3-seed mean) | 1.1147 | **1.1015** | **-0.0132** |
22+
| val_loss (nats) | 1.88218 | **1.85982** | **-0.02236** |
23+
24+
Clears the 0.005 nats threshold by 4.5x.
25+
26+
## Changes vs Baseline (PR #1019)
27+
28+
### 1. SLOT: Sample-specific LM Optimization at Test-time
29+
30+
At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection. The model forward is split into `forward_hidden()` (frozen, no grad) and `compute_logits()` (carries grad for delta optimization).
31+
32+
- **Delta shape**: `[1, 1, 512]` — broadcasts across batch and sequence
33+
- **Optimizer**: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5)
34+
- **Steps**: 8 per batch
35+
- **Eval time overhead**: ~90s (well within 600s eval budget)
36+
37+
SLOT is score-first: hidden states are computed under `torch.no_grad()`, the delta adapts through `compute_logits()` only, and final scoring uses the adapted logits. The model weights are never modified.
38+
39+
Reference: Hu et al., arXiv:2505.12392v2. Also used in PR #1128, PR #1105.
40+
41+
### 2. Sigmoid-Gated Skip Connections
42+
43+
U-Net skip connections use learned sigmoid gates instead of simple addition:
44+
```python
45+
g = sigmoid(skip_gates[i])
46+
x = lerp(skip_weights[i] * skip, x, g)
47+
```
48+
Gate starts at sigmoid(0) = 0.5 (balanced blend). Adds 2,560 params (5 gates x 512 dims).
49+
50+
### 3. Soft-Round QAT with Alpha Ramp
51+
52+
Late QAT uses differentiable sigmoid rounding instead of hard STE:
53+
```python
54+
soft_rounded = floor(scaled) + sigmoid(alpha * (frac - 0.5))
55+
```
56+
Alpha ramps from 1 (smooth) to 16 (near-hard) over 500 steps. Provides real gradients through rounding, letting weights adapt to quantization grid.
57+
58+
### 4. Split Early/Late Muon Learning Rate
59+
60+
Bank gradients are scaled per-layer before the Muon reduce-scatter:
61+
- Early layers (0-4): Muon LR = 0.025
62+
- Late layers (5-10): Muon LR = 0.030
63+
64+
Late layers benefit from higher LR (weaker gradient signal further from loss).
65+
66+
### 5. Warmdown = 4000 Steps
67+
68+
Extended warmdown from 3500 to 4000 estimated steps. Holds LR higher for longer, giving the model more time at productive learning rates.
69+
70+
### 6. BigramHash(2816x160)
71+
72+
Enlarged bigram embedding dimension from 112 to 160. Same 2816 buckets. Richer per-bucket representation at minimal artifact cost.
73+
74+
### 7. Code Minification
75+
76+
`pyminify` + LZMA2 + base85 self-extracting wrapper reduces code from 101KB to 23KB, freeing ~78KB of artifact budget for model weights.
77+
78+
### 8. Brotli-11 Compression with Byte-Shuffle
79+
80+
Replaces LZMA-6 with Brotli quality=11 + stride-2 byte-shuffle preprocessing. Saves ~400KB vs LZMA.
81+
82+
### 9. GPTQ Reserve 9s (was 14s)
83+
84+
Reduced GPTQ calibration time reservation from 14s to 9s, gaining ~55 extra training steps.
85+
86+
## Negative Results (tested, did not help)
87+
88+
| Technique | Result | Notes |
89+
|-----------|--------|-------|
90+
| Turbo-Muon (AOL + Polar Express) | +2MB artifact bloat | Weight distribution changes break compression |
91+
| No-GPTQ (PR #1120 style) | -0.005 BPP worse | GPTQ essential for our stack |
92+
| Pure EngramLite swap | -0.003 worse | Same-budget multi-head too diluted |
93+
| ResidLambdas | -0.003 worse | Quant error compounds through lambda scaling |
94+
| LeakyReLU slope=0.3 | Neutral | |
95+
| Partial key offset | Neutral | |
96+
| BIGRAM_DIM=192 | -0.001 worse | Diminishing returns past 160 |
97+
| TTT (score-first SGD) | Neutral on Full GPTQ stack | Post-quant weights too well-optimized |
98+
| Mixed int5/int6 GPTQ | Broken or worse | Needs full PR #1089-style pipeline |
99+
100+
## Architecture Summary
101+
102+
| Component | Setting | Source |
103+
|-----------|---------|--------|
104+
| Layers | 11 | PR #549 |
105+
| Model dim | 512 | PR #549 |
106+
| Heads / KV heads | 8 / 4 (GQA) | PR #549 |
107+
| MLP mult | 3.0x (LeakyReLU(0.5)^2) | PR #549 |
108+
| XSA | All 11 layers | PR #1019 |
109+
| BigramHash | 2816 x 160 | **This submission** (dim=160) |
110+
| ValueEmbedding | dim=128, layers 9,10 | PR #549 |
111+
| SmearGate | F.pad causal shift | PR #549, optimized |
112+
| Skip connections | Sigmoid-gated lerp | **This submission** |
113+
| Quantization | Full Hessian GPTQ int6 | PR #1019 |
114+
| Compression | Brotli-11 + byte-shuffle | **This submission** |
115+
| Optimizer | Parallel Muon + Split-LR | **This submission** (split-LR) |
116+
| QAT | Soft-round alpha ramp 1->16 | **This submission** |
117+
| Eval | Sliding window stride=64 + SLOT | **This submission** (SLOT) |
118+
| Code | LZMA2 self-extracting wrapper | **This submission** |
119+
| Warmdown | 4000 steps | **This submission** |
120+
| Params | 27.2M | |
121+
122+
## Setup & Reproduction
123+
124+
```bash
125+
# Environment: 8xH100 SXM, PyTorch 2.9.1+cu128, flash-attn 2.8.3
126+
export NCCL_NET=Socket # Required on GCP H100
127+
export SLOT_ENABLED=1
128+
export BIGRAM_DIM=160
129+
export WARMDOWN_ITERS=4000
130+
export SLOT_LR=0.005
131+
export SLOT_STEPS=8
132+
133+
# Run with torchrun (evaluate.py handles this)
134+
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
135+
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
136+
SEED=2025 torchrun --standalone --nproc_per_node=8 train_gpt.py
137+
```
138+
139+
## Acknowledgements
140+
141+
Thanks to **@0hq** and **@valerio-oai** for organizing and maintaining an excellent competition.
142+
143+
This submission builds directly on @abaybektursun's PR #549 and PR #1019, which established the LeakyReLU^2 + Parallel Muon + XSA + Full GPTQ stack. The SLOT technique follows Hu et al. (arXiv:2505.12392v2) and was independently validated by @AnubhavBharadwaaj (PR #1128) and @abaybektursun (PR #1105). The sigmoid-gated skip connection idea draws from @mikeapedia's PR #1089. Code minification approach adapted from PR #1089's shrink pipeline.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "SLOT + Split-LR + Full GPTQ + Sigmoid-Gated Skips + Soft-Round QAT + XSA-all",
3+
"val_bpb": 1.1015,
4+
"bytes_total": 15658061,
5+
"blurb": "SLOT eval-time delta optimization (lr=0.005, 8 AdamW steps per batch) + split early/late Muon LR (0.025/0.030) + Full Hessian GPTQ int6 + sigmoid-gated U-Net skip connections + soft-round QAT with alpha ramp + Brotli-11 byte-shuffle compression + BigramHash(2816x160) + code minification (23KB wrapper). 3-seed mean: 1.1015 (std 0.0011). Built on PR #1019 by @abaybektursun.",
6+
"author": "dexhunter",
7+
"github_id": "dexhunter",
8+
"date": "2026-03-31"
9+
}

0 commit comments

Comments
 (0)