|
| 1 | +# Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015) |
| 2 | + |
| 3 | +**val_bpb: 1.1015** (3-seed mean, std 0.0011) | **1.8598 nats** | **~15.65 MB** | 8xH100 SXM, 600s train + 177s eval |
| 4 | + |
| 5 | +Built on [PR #1019](https://github.com/openai/parameter-golf/pull/1019) by @abaybektursun. |
| 6 | +Previous: [PR #549](https://github.com/openai/parameter-golf/pull/549) (1.1194) -> [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> this. |
| 7 | + |
| 8 | +## Results (8xH100 SXM) |
| 9 | + |
| 10 | +| Seed | Steps | ms/step | Post-EMA BPB | **Sliding+SLOT BPB** | val_loss (nats) | Artifact | |
| 11 | +|------|-------|---------|-------------|---------------------|-----------------|----------| |
| 12 | +| 1337 | 6704 | 88.2 | 1.1309 | **1.10213** | 1.8609 | 15,647,124 | |
| 13 | +| 42 | 6706 | 88.2 | 1.1289 | **1.10019** | 1.8576 | 15,658,061 | |
| 14 | +| 2025 | 6684 | 88.4 | 1.1310 | **1.10216** | 1.8609 | 15,650,266 | |
| 15 | +| **Mean** | **6698** | **88.3** | **1.1303** | **1.10149** | **1.8598** | **15,651,817** | |
| 16 | + |
| 17 | +### Improvement vs SOTA |
| 18 | + |
| 19 | +| Metric | Merged SOTA (PR #1019) | This submission | Delta | |
| 20 | +|--------|----------------------|-----------------|-------| |
| 21 | +| val_bpb (3-seed mean) | 1.1147 | **1.1015** | **-0.0132** | |
| 22 | +| val_loss (nats) | 1.88218 | **1.85982** | **-0.02236** | |
| 23 | + |
| 24 | +Clears the 0.005 nats threshold by 4.5x. |
| 25 | + |
| 26 | +## Changes vs Baseline (PR #1019) |
| 27 | + |
| 28 | +### 1. SLOT: Sample-specific LM Optimization at Test-time |
| 29 | + |
| 30 | +At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection. The model forward is split into `forward_hidden()` (frozen, no grad) and `compute_logits()` (carries grad for delta optimization). |
| 31 | + |
| 32 | +- **Delta shape**: `[1, 1, 512]` — broadcasts across batch and sequence |
| 33 | +- **Optimizer**: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5) |
| 34 | +- **Steps**: 8 per batch |
| 35 | +- **Eval time overhead**: ~90s (well within 600s eval budget) |
| 36 | + |
| 37 | +SLOT is score-first: hidden states are computed under `torch.no_grad()`, the delta adapts through `compute_logits()` only, and final scoring uses the adapted logits. The model weights are never modified. |
| 38 | + |
| 39 | +Reference: Hu et al., arXiv:2505.12392v2. Also used in PR #1128, PR #1105. |
| 40 | + |
| 41 | +### 2. Sigmoid-Gated Skip Connections |
| 42 | + |
| 43 | +U-Net skip connections use learned sigmoid gates instead of simple addition: |
| 44 | +```python |
| 45 | +g = sigmoid(skip_gates[i]) |
| 46 | +x = lerp(skip_weights[i] * skip, x, g) |
| 47 | +``` |
| 48 | +Gate starts at sigmoid(0) = 0.5 (balanced blend). Adds 2,560 params (5 gates x 512 dims). |
| 49 | + |
| 50 | +### 3. Soft-Round QAT with Alpha Ramp |
| 51 | + |
| 52 | +Late QAT uses differentiable sigmoid rounding instead of hard STE: |
| 53 | +```python |
| 54 | +soft_rounded = floor(scaled) + sigmoid(alpha * (frac - 0.5)) |
| 55 | +``` |
| 56 | +Alpha ramps from 1 (smooth) to 16 (near-hard) over 500 steps. Provides real gradients through rounding, letting weights adapt to quantization grid. |
| 57 | + |
| 58 | +### 4. Split Early/Late Muon Learning Rate |
| 59 | + |
| 60 | +Bank gradients are scaled per-layer before the Muon reduce-scatter: |
| 61 | +- Early layers (0-4): Muon LR = 0.025 |
| 62 | +- Late layers (5-10): Muon LR = 0.030 |
| 63 | + |
| 64 | +Late layers benefit from higher LR (weaker gradient signal further from loss). |
| 65 | + |
| 66 | +### 5. Warmdown = 4000 Steps |
| 67 | + |
| 68 | +Extended warmdown from 3500 to 4000 estimated steps. Holds LR higher for longer, giving the model more time at productive learning rates. |
| 69 | + |
| 70 | +### 6. BigramHash(2816x160) |
| 71 | + |
| 72 | +Enlarged bigram embedding dimension from 112 to 160. Same 2816 buckets. Richer per-bucket representation at minimal artifact cost. |
| 73 | + |
| 74 | +### 7. Code Minification |
| 75 | + |
| 76 | +`pyminify` + LZMA2 + base85 self-extracting wrapper reduces code from 101KB to 23KB, freeing ~78KB of artifact budget for model weights. |
| 77 | + |
| 78 | +### 8. Brotli-11 Compression with Byte-Shuffle |
| 79 | + |
| 80 | +Replaces LZMA-6 with Brotli quality=11 + stride-2 byte-shuffle preprocessing. Saves ~400KB vs LZMA. |
| 81 | + |
| 82 | +### 9. GPTQ Reserve 9s (was 14s) |
| 83 | + |
| 84 | +Reduced GPTQ calibration time reservation from 14s to 9s, gaining ~55 extra training steps. |
| 85 | + |
| 86 | +## Negative Results (tested, did not help) |
| 87 | + |
| 88 | +| Technique | Result | Notes | |
| 89 | +|-----------|--------|-------| |
| 90 | +| Turbo-Muon (AOL + Polar Express) | +2MB artifact bloat | Weight distribution changes break compression | |
| 91 | +| No-GPTQ (PR #1120 style) | -0.005 BPP worse | GPTQ essential for our stack | |
| 92 | +| Pure EngramLite swap | -0.003 worse | Same-budget multi-head too diluted | |
| 93 | +| ResidLambdas | -0.003 worse | Quant error compounds through lambda scaling | |
| 94 | +| LeakyReLU slope=0.3 | Neutral | | |
| 95 | +| Partial key offset | Neutral | | |
| 96 | +| BIGRAM_DIM=192 | -0.001 worse | Diminishing returns past 160 | |
| 97 | +| TTT (score-first SGD) | Neutral on Full GPTQ stack | Post-quant weights too well-optimized | |
| 98 | +| Mixed int5/int6 GPTQ | Broken or worse | Needs full PR #1089-style pipeline | |
| 99 | + |
| 100 | +## Architecture Summary |
| 101 | + |
| 102 | +| Component | Setting | Source | |
| 103 | +|-----------|---------|--------| |
| 104 | +| Layers | 11 | PR #549 | |
| 105 | +| Model dim | 512 | PR #549 | |
| 106 | +| Heads / KV heads | 8 / 4 (GQA) | PR #549 | |
| 107 | +| MLP mult | 3.0x (LeakyReLU(0.5)^2) | PR #549 | |
| 108 | +| XSA | All 11 layers | PR #1019 | |
| 109 | +| BigramHash | 2816 x 160 | **This submission** (dim=160) | |
| 110 | +| ValueEmbedding | dim=128, layers 9,10 | PR #549 | |
| 111 | +| SmearGate | F.pad causal shift | PR #549, optimized | |
| 112 | +| Skip connections | Sigmoid-gated lerp | **This submission** | |
| 113 | +| Quantization | Full Hessian GPTQ int6 | PR #1019 | |
| 114 | +| Compression | Brotli-11 + byte-shuffle | **This submission** | |
| 115 | +| Optimizer | Parallel Muon + Split-LR | **This submission** (split-LR) | |
| 116 | +| QAT | Soft-round alpha ramp 1->16 | **This submission** | |
| 117 | +| Eval | Sliding window stride=64 + SLOT | **This submission** (SLOT) | |
| 118 | +| Code | LZMA2 self-extracting wrapper | **This submission** | |
| 119 | +| Warmdown | 4000 steps | **This submission** | |
| 120 | +| Params | 27.2M | | |
| 121 | + |
| 122 | +## Setup & Reproduction |
| 123 | + |
| 124 | +```bash |
| 125 | +# Environment: 8xH100 SXM, PyTorch 2.9.1+cu128, flash-attn 2.8.3 |
| 126 | +export NCCL_NET=Socket # Required on GCP H100 |
| 127 | +export SLOT_ENABLED=1 |
| 128 | +export BIGRAM_DIM=160 |
| 129 | +export WARMDOWN_ITERS=4000 |
| 130 | +export SLOT_LR=0.005 |
| 131 | +export SLOT_STEPS=8 |
| 132 | + |
| 133 | +# Run with torchrun (evaluate.py handles this) |
| 134 | +SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 135 | +SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 136 | +SEED=2025 torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 137 | +``` |
| 138 | + |
| 139 | +## Acknowledgements |
| 140 | + |
| 141 | +Thanks to **@0hq** and **@valerio-oai** for organizing and maintaining an excellent competition. |
| 142 | + |
| 143 | +This submission builds directly on @abaybektursun's PR #549 and PR #1019, which established the LeakyReLU^2 + Parallel Muon + XSA + Full GPTQ stack. The SLOT technique follows Hu et al. (arXiv:2505.12392v2) and was independently validated by @AnubhavBharadwaaj (PR #1128) and @abaybektursun (PR #1105). The sigmoid-gated skip connection idea draws from @mikeapedia's PR #1089. Code minification approach adapted from PR #1089's shrink pipeline. |
0 commit comments