|
| 1 | +# 11L Partial RoPE + XSA4 + VE128 + Tight SWA + Late QAT + GPTQ-lite |
| 2 | + |
| 3 | +## Score: val_bpb = 1.1804 (post-quant, single seed) |
| 4 | + |
| 5 | +Trained on 8×H100 SXM in 615 seconds. 15.95MB artifact (int6+zstd-22). |
| 6 | + |
| 7 | +## Approach |
| 8 | + |
| 9 | +Combines the PR #374 SOTA stack with MLP width reduction (1408 vs 1536) to fit under 16MB, plus GPTQ-lite quantization optimization. |
| 10 | + |
| 11 | +### Architecture |
| 12 | +- 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA) |
| 13 | +- MLP hidden=1408 (2.75× expansion), relu-squared activation |
| 14 | +- **Partial RoPE** (16/64 dims): Only 25% of head dims get rotary embeddings. The remaining 75% are position-free, improving generalization. |
| 15 | +- **LN Scale** (1/sqrt(layer_idx+1)): Damps RMSNorm output in deeper layers, stabilizing gradient flow. |
| 16 | +- **XSA** on last 4 layers: Exclusive Self Attention removes self-value bias via GQA-aware orthogonal projection. Zero new parameters, ~2ms/step. |
| 17 | +- **Shared Value Embedding** (dim=128, layers 9,10): Single embedding table projected to KV dim, added to V in selected layers. Per-layer learned scales. |
| 18 | +- SmearGate: Learned per-dim gate blending current + previous token embeddings. |
| 19 | +- U-Net skip connections (5 encoder, 6 decoder), tied embeddings, logit softcap 30. |
| 20 | + |
| 21 | +### Training |
| 22 | +- Muon optimizer: lr=0.025, momentum=0.99 (warmup 0.92→0.99 over 1500 steps), WD=0.04 |
| 23 | +- AdamW: embed_lr=0.035, scalar_lr=0.025, WD=0.04 |
| 24 | +- Batch: 786,432 tokens/step, seq_len=2048 |
| 25 | +- Warmdown: 3000 iters (wallclock-based), grad_clip=0.3 |
| 26 | +- **Tight SWA**: Uniform average of checkpoints collected every 50 steps when lr_scale < 0.2 (6 checkpoints total). Zero quality penalty vs non-SWA. |
| 27 | +- **Late QAT**: STE int6 fake-quantization activated when lr_scale < 0.1 (step 4070). LR halved at activation to avoid disrupting converged weights. |
| 28 | + |
| 29 | +### Quantization |
| 30 | +- **GPTQ-lite**: Per-tensor clip ratio search (5 candidates: 0.9999, 0.99999, 0.999999, 0.9999984, 1.0). Selects the clip percentile that minimizes reconstruction error L2. Zero training cost. |
| 31 | +- Int6 step=4 rounding on layers 1-9 (64 distinct values for better compression) |
| 32 | +- Int8 on layers 0 and 10 (input/output quality) |
| 33 | +- FP16 tied embeddings (never quantized) |
| 34 | +- zstd level 22 compression |
| 35 | + |
| 36 | +## Key Metrics |
| 37 | + |
| 38 | +| Metric | Value | |
| 39 | +|--------|-------| |
| 40 | +| Pre-quant val_bpb | 1.1770 | |
| 41 | +| **Post-quant val_bpb** | **1.1804** | |
| 42 | +| Quant gap | +0.0034 | |
| 43 | +| Steps completed | 4,071 | |
| 44 | +| Step time | 137ms avg (151ms after Late QAT) | |
| 45 | +| Model parameters | 25,224,291 | |
| 46 | +| Artifact size | 15,949,473 bytes (15.95 MB) | |
| 47 | +| Peak GPU memory | 20,590 MiB | |
| 48 | + |
| 49 | +## Convergence |
| 50 | + |
| 51 | +| Step | val_bpb | train_time | |
| 52 | +|------|---------|-----------| |
| 53 | +| 1000 | 1.3246 | 136s | |
| 54 | +| 2000 | 1.2551 | 274s | |
| 55 | +| 3000 | 1.2139 | 413s | |
| 56 | +| 4000 | 1.1793 | 551s | |
| 57 | +| 4071 | 1.1770 | 615s (cap) | |
| 58 | + |
| 59 | +## Lessons Learned |
| 60 | + |
| 61 | +1. **MLP hidden=1408 > 1536 for artifact-constrained models**: Narrower MLP fits in 16MB with int6+zstd while enabling ~33% more training steps (137ms vs 178ms/step). The extra steps more than compensate for reduced per-step capacity. |
| 62 | + |
| 63 | +2. **Late QAT timing matters**: Activating at lr_scale<0.1 (last ~1% of training) gives only 1 step of QAT adaptation. Earlier activation (lr_scale<0.2) would give more adaptation time but risks disrupting Muon momentum. |
| 64 | + |
| 65 | +3. **Tight SWA (scale<0.2) eliminates SWA quality penalty**: Standard SWA (scale<0.5) averages stale early-warmdown checkpoints that hurt final quality. Restricting to scale<0.2 produces weight averaging with zero quality loss. |
| 66 | + |
| 67 | +4. **GPTQ-lite clip search is free**: Trying 5 clip ratios per tensor during quantization costs ~2s total and reduces reconstruction error without any training cost. |
| 68 | + |
| 69 | +## Setup |
| 70 | + |
| 71 | +```bash |
| 72 | +pip install --break-system-packages zstandard |
| 73 | +# or: pip install -r requirements.txt |
| 74 | +``` |
| 75 | + |
| 76 | +## Command |
| 77 | + |
| 78 | +```bash |
| 79 | +RUN_ID=pr374_8x_v2 MLP_HIDDEN=1408 \ |
| 80 | +DATA_PATH=../../../data/datasets/fineweb10B_sp1024/ \ |
| 81 | +TOKENIZER_PATH=../../../data/tokenizers/fineweb_1024_bpe.model \ |
| 82 | +VOCAB_SIZE=1024 \ |
| 83 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 84 | +``` |
| 85 | + |
| 86 | +## Status |
| 87 | + |
| 88 | +**Non-record submission.** Single seed (1337). val_bpb 1.1804 does not beat SOTA 1.1428 by the required 0.005 margin. |
| 89 | + |
| 90 | +Submitted to document the systematic combination of frontier techniques (Partial RoPE, LN Scale, XSA, Shared VE, Tight SWA, Late QAT, GPTQ-lite) with the novel insight that MLP hidden=1408 (vs 1536) produces better results under the 16MB constraint because faster step time yields more training steps. |
0 commit comments