|
| 1 | +# Record: Coprime-Stride Loader + Full Hessian GPTQ + XSA-all + Optimized GPTQ Reserve (val_bpb 1.1136) |
| 2 | + |
| 3 | +**val_bpb: 1.1136** (3-seed mean, std 0.0003) | **~15.90 MB** | 8xH100 SXM, 600s train, ~85s eval |
| 4 | + |
| 5 | +Built on [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun and [PR #1060](https://github.com/openai/parameter-golf/pull/1060) by @dexhunter. |
| 6 | + |
| 7 | +## Results (8xH100 SXM, no TTT) |
| 8 | + |
| 9 | +| Seed | Sliding BPB | Artifact | |
| 10 | +|------|-------------|----------| |
| 11 | +| 1337 | **1.1136** | 15,901,403 | |
| 12 | +| 42 | **1.1133** | 15,888,867 | |
| 13 | +| 999 | **1.1139** | 15,892,171 | |
| 14 | +| **Mean +/- Std** | **1.1136 +/- 0.0003** | | |
| 15 | + |
| 16 | +## What's New |
| 17 | + |
| 18 | +This submission extends PR #1060's coprime-stride loader + Full Hessian GPTQ stack with two targeted improvements: |
| 19 | + |
| 20 | +### 1. GPTQ Reserve Optimization |
| 21 | +Reduced GPTQ calibration reserve from 14s to 10s. PR #1060's GPTQ calibration completes in ~8.4s, so 14s wastes ~4s of training budget. This recovers ~44 additional training steps at 91ms/step, translating to measurable BPB improvement. |
| 22 | + |
| 23 | +### 2. FA3/FA2 Graceful Fallback |
| 24 | +Added try/except import for `flash_attn_interface` (FA3) with fallback to `flash_attn` (FA2). Allows the same script to run on pods with or without FA3 Hopper kernels built. |
| 25 | + |
| 26 | +### 3. FP32 SWA Accumulation (experimental, not used in final) |
| 27 | +Fixed SWA accumulation to use FP32 instead of model dtype (BF16). A/B testing showed EMA(0.997) still outperforms SWA on this stack by ~0.0006 BPB, so EMA is used for the final submission. |
| 28 | + |
| 29 | +## Architecture |
| 30 | + |
| 31 | +PR #549 + PR #1060 stack: |
| 32 | +- 11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)^2 |
| 33 | +- Coprime-stride multi-shard data pipeline |
| 34 | +- XSA on all 11 layers, BigramHash(2816x112), SmearGate |
| 35 | +- Partial RoPE (16d), LN Scale, EMA(0.997) |
| 36 | +- Full Hessian GPTQ int6 + LZMA compression (10s reserve) |
| 37 | +- Parallel Muon + Parameter Banking, FA3 Hopper |
| 38 | + |
| 39 | +## Timing |
| 40 | + |
| 41 | +| Phase | Time | |
| 42 | +|-------|------| |
| 43 | +| Training (~6,479 steps @ 91ms) | 590s | |
| 44 | +| GPTQ calibration + quantization | 10s (reserved from training) | |
| 45 | +| Sliding window eval (stride=64) | ~85s | |
| 46 | +| **Total eval** | **~85s** | |
| 47 | + |
| 48 | +## Env Vars (overrides from defaults) |
| 49 | + |
| 50 | +``` |
| 51 | +BIGRAM_VOCAB_SIZE=2816 |
| 52 | +BIGRAM_DIM=112 |
| 53 | +XSA_LAST_N=11 |
| 54 | +USE_GPTQ=1 |
| 55 | +GPTQ_RESERVE_MS=10000 |
| 56 | +WARMDOWN_ITERS=4000 |
| 57 | +SWA_APPLY=0 |
| 58 | +TTT_ENABLED=0 |
| 59 | +``` |
| 60 | + |
| 61 | +## Rule Compliance |
| 62 | + |
| 63 | +- Standard F.cross_entropy scoring (softmax, sum=1) |
| 64 | +- No TTT, no mixer, no eval-built adaptation, no unnormalized scoring |
| 65 | +- Full `fineweb_val_*` split in canonical sorted order with tokenizer-derived byte accounting |
| 66 | +- Artifact < 16,000,000 bytes (all 3 seeds) |
| 67 | +- Training < 600s, eval < 600s |
| 68 | +- Causal sliding-window evaluation on the full validation split (stride=64) |
| 69 | + |
| 70 | +## Reproduction |
| 71 | + |
| 72 | +```bash |
| 73 | +# From this records folder (with data symlinked): |
| 74 | +SEED=1337 \ |
| 75 | +BIGRAM_VOCAB_SIZE=2816 \ |
| 76 | +BIGRAM_DIM=112 \ |
| 77 | +XSA_LAST_N=11 \ |
| 78 | +USE_GPTQ=1 \ |
| 79 | +GPTQ_RESERVE_MS=10000 \ |
| 80 | +WARMDOWN_ITERS=4000 \ |
| 81 | +SWA_APPLY=0 \ |
| 82 | +TTT_ENABLED=0 \ |
| 83 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 84 | +``` |
| 85 | + |
| 86 | +Environment: PyTorch 2.6+, Flash Attention 3 (`flash_attn_interface`), 8xH100 SXM. |
| 87 | + |
| 88 | +## Credits |
| 89 | + |
| 90 | +- **Base scaffold**: [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun (LeakyReLU^2 + Parallel Muon) |
| 91 | +- **Coprime-stride loader + Full GPTQ + XSA-all**: [PR #1060](https://github.com/openai/parameter-golf/pull/1060) by @dexhunter |
| 92 | +- **Data pipeline ideas**: [PR #726](https://github.com/openai/parameter-golf/pull/726) by @DeepReinforce |
| 93 | +- **Full Hessian GPTQ**: [PR #634](https://github.com/openai/parameter-golf/pull/634) by @raahilshah |
| 94 | +- **XSA**: [PR #287](https://github.com/openai/parameter-golf/pull/287) by @jfprincz |
0 commit comments