Skip to content

Commit cf068e9

Browse files
committed
Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1136 (3-seed mean)
3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
1 parent 50390d6 commit cf068e9

6 files changed

Lines changed: 2440 additions & 0 deletions

File tree

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Record: Coprime-Stride Loader + Full Hessian GPTQ + XSA-all + Optimized GPTQ Reserve (val_bpb 1.1136)
2+
3+
**val_bpb: 1.1136** (3-seed mean, std 0.0003) | **~15.90 MB** | 8xH100 SXM, 600s train, ~85s eval
4+
5+
Built on [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun and [PR #1060](https://github.com/openai/parameter-golf/pull/1060) by @dexhunter.
6+
7+
## Results (8xH100 SXM, no TTT)
8+
9+
| Seed | Sliding BPB | Artifact |
10+
|------|-------------|----------|
11+
| 1337 | **1.1136** | 15,901,403 |
12+
| 42 | **1.1133** | 15,888,867 |
13+
| 999 | **1.1139** | 15,892,171 |
14+
| **Mean +/- Std** | **1.1136 +/- 0.0003** | |
15+
16+
## What's New
17+
18+
This submission extends PR #1060's coprime-stride loader + Full Hessian GPTQ stack with two targeted improvements:
19+
20+
### 1. GPTQ Reserve Optimization
21+
Reduced GPTQ calibration reserve from 14s to 10s. PR #1060's GPTQ calibration completes in ~8.4s, so 14s wastes ~4s of training budget. This recovers ~44 additional training steps at 91ms/step, translating to measurable BPB improvement.
22+
23+
### 2. FA3/FA2 Graceful Fallback
24+
Added try/except import for `flash_attn_interface` (FA3) with fallback to `flash_attn` (FA2). Allows the same script to run on pods with or without FA3 Hopper kernels built.
25+
26+
### 3. FP32 SWA Accumulation (experimental, not used in final)
27+
Fixed SWA accumulation to use FP32 instead of model dtype (BF16). A/B testing showed EMA(0.997) still outperforms SWA on this stack by ~0.0006 BPB, so EMA is used for the final submission.
28+
29+
## Architecture
30+
31+
PR #549 + PR #1060 stack:
32+
- 11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)^2
33+
- Coprime-stride multi-shard data pipeline
34+
- XSA on all 11 layers, BigramHash(2816x112), SmearGate
35+
- Partial RoPE (16d), LN Scale, EMA(0.997)
36+
- Full Hessian GPTQ int6 + LZMA compression (10s reserve)
37+
- Parallel Muon + Parameter Banking, FA3 Hopper
38+
39+
## Timing
40+
41+
| Phase | Time |
42+
|-------|------|
43+
| Training (~6,479 steps @ 91ms) | 590s |
44+
| GPTQ calibration + quantization | 10s (reserved from training) |
45+
| Sliding window eval (stride=64) | ~85s |
46+
| **Total eval** | **~85s** |
47+
48+
## Env Vars (overrides from defaults)
49+
50+
```
51+
BIGRAM_VOCAB_SIZE=2816
52+
BIGRAM_DIM=112
53+
XSA_LAST_N=11
54+
USE_GPTQ=1
55+
GPTQ_RESERVE_MS=10000
56+
WARMDOWN_ITERS=4000
57+
SWA_APPLY=0
58+
TTT_ENABLED=0
59+
```
60+
61+
## Rule Compliance
62+
63+
- Standard F.cross_entropy scoring (softmax, sum=1)
64+
- No TTT, no mixer, no eval-built adaptation, no unnormalized scoring
65+
- Full `fineweb_val_*` split in canonical sorted order with tokenizer-derived byte accounting
66+
- Artifact < 16,000,000 bytes (all 3 seeds)
67+
- Training < 600s, eval < 600s
68+
- Causal sliding-window evaluation on the full validation split (stride=64)
69+
70+
## Reproduction
71+
72+
```bash
73+
# From this records folder (with data symlinked):
74+
SEED=1337 \
75+
BIGRAM_VOCAB_SIZE=2816 \
76+
BIGRAM_DIM=112 \
77+
XSA_LAST_N=11 \
78+
USE_GPTQ=1 \
79+
GPTQ_RESERVE_MS=10000 \
80+
WARMDOWN_ITERS=4000 \
81+
SWA_APPLY=0 \
82+
TTT_ENABLED=0 \
83+
torchrun --standalone --nproc_per_node=8 train_gpt.py
84+
```
85+
86+
Environment: PyTorch 2.6+, Flash Attention 3 (`flash_attn_interface`), 8xH100 SXM.
87+
88+
## Credits
89+
90+
- **Base scaffold**: [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun (LeakyReLU^2 + Parallel Muon)
91+
- **Coprime-stride loader + Full GPTQ + XSA-all**: [PR #1060](https://github.com/openai/parameter-golf/pull/1060) by @dexhunter
92+
- **Data pipeline ideas**: [PR #726](https://github.com/openai/parameter-golf/pull/726) by @DeepReinforce
93+
- **Full Hessian GPTQ**: [PR #634](https://github.com/openai/parameter-golf/pull/634) by @raahilshah
94+
- **XSA**: [PR #287](https://github.com/openai/parameter-golf/pull/287) by @jfprincz
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "Coprime-Stride Loader + Full Hessian GPTQ + XSA-all + FA3 Fallback",
3+
"val_bpb": 1.1136,
4+
"bytes_total": 15901403,
5+
"blurb": "Coprime-stride multi-shard data pipeline (PR #726 style) + Full Hessian GPTQ with Cholesky error compensation + XSA on all 11 layers + BigramHash(2816x112) + EMA(0.997) + Parallel Muon + FA3/FA2 graceful fallback. GPTQ reserve reduced from 14s to 10s for ~44 extra training steps. No TTT. 3-seed mean: 1.1136 (std 0.0003). Built on PR #549 and PR #1060.",
6+
"author": "Bortlesboat",
7+
"github_id": "Bortlesboat",
8+
"date": "2026-03-29"
9+
}

0 commit comments

Comments
 (0)