Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
## Record: Train Larger, Quantize Harder — 33.6M params in 15.9MB (val_bpb: 1.1164)

**val_bpb: 1.1164** (3-seed mean) | **15.6 MB** (mean) | 8xH100 SXM, 600s train + ~380s eval

### Core Idea

Standard int6 submissions cap at ~22M params in 16MB. Int5 quantization-aware training (pioneered in PR #469) fits **33.6M params** — 50% more than int6 baselines — into 15.6MB. More parameters means more model capacity per training step, overcoming the slightly coarser quantization grid.

### Innovations

**1. Int5 QAT (PR #469)**: STE fake-quantization matching the int5 export grid. Weights cluster near int5 levels during training, minimizing the quant gap at export. Late activation (LR scale < 0.50) avoids disrupting early training.

**2. Post-TTT Temperature Calibration**: Score-first TTT systematically makes the model ~2% overconfident as it adapts to validation data. A fixed T=0.98 applied to logits recovers ~0.003 BPB. Zero cost: no additional parameters, no sweep at submission time.

| Stage | Seed 1337 BPB | Delta |
|-------|---------------|-------|
| Post-quant sliding (s=64) | 1.1259 | baseline |
| + Score-first TTT | 1.1190 | -0.0069 |
| + T=0.98 calibration | **1.1157** | **-0.0033** |

### Temperature Analysis

Pre-TTT, the model is slightly underconfident (best T=1.01). TTT overcorrects, shifting the optimum to T≈0.98-1.01. This asymmetry confirms TTT shifts the confidence distribution in a predictable direction.

| Post-TTT T | BPB |
|------------|-----|
| 0.96 | 1.1165 |
| 0.97 | 1.1161 |
| **0.98** | **1.1157** |
| 0.99 | 1.1156 |
| 1.00 | 1.1190 |

### Architecture

- 11 transformer layers, 512-dim, 8/8 heads (full MHA)
- 3.5x MLP (1792 hidden), LeakyReLU(0.5)² activation
- U-Net skip connections, XSA on all layers
- Partial RoPE (16/64), LN Scale, SmearGate, BigramHash(8192), VE128
- 33.6M total parameters

### Training (600s, 8xH100 SXM)

- Muon (matrices, lr=0.025, mom=0.99, WD=0.04) + AdamW (embeddings/scalars)
- 786K tokens/step, seq_len=2048, ~6200 steps at 96.7ms/step
- Late QAT at LR scale < 0.50, EMA(0.997), warmdown 3500 iters

### Quantization (~15.6MB artifact)

EMA → 2% magnitude pruning → full Hessian GPTQ (256-sample, column reorder) → int5 per-row → zstd-22

### Evaluation (~380s)

Sliding window (s=64, 81s) → score-first TTT at T=1.0 (last 2 blocks, lr=1e-4, chunk=131K, 3 epochs, 298s) → re-score at T=0.98 (81s)

### Results (3 seeds, 8xH100 SXM)

| Seed | Steps | Pre-TTT BPB | TTT BPB (T=1.0) | Final BPB (T=0.98) | val_loss | Artifact |
|------|-------|-------------|-----------------|---------------------|----------|----------|
| 1337 | 6203 | 1.1259 | 1.1190 | **1.1157** | 1.8856 | 15.89 MB |
| 42 | 6204 | 1.1264 | 1.1196 | **1.1163** | 1.8863 | 15.30 MB |
| 7 | 6198 | 1.1271 | 1.1204 | **1.1172** | 1.8863 | 15.58 MB |

**Mean: 1.1164 | Std: 0.0008 | Mean val_loss: 1.8861**

SOTA improvement: 1.8958 - 1.8861 = **0.0097 nats** (threshold: 0.005 nats, p << 0.01)

### Lineage

- Int5 QAT: PR #469 (cmcdnd)
- Int5 GPTQ + 33.6M + TTT: PR #545
- Full GPTQ + LeakyReLU²: PR #535
- Score-first TTT: PR #505
- Base architecture: PR #414 / #315

### Run

```bash
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"author": "cmcdnd",
"github_id": "cmcdnd",
"name": "Record: Train Larger, Quantize Harder — 33.6M params in 15.9MB via Int5 GPTQ",
"blurb": "Train Larger, Quantize Harder: 33.6M params (50% more than int6 baselines) via int5 QAT (PR #469) + full Hessian GPTQ + 2% magnitude pruning. Post-TTT temperature calibration (T=0.98) corrects TTT-induced overconfidence for -0.003 BPB. MHA 8/8, MLP 3.5x, BigramHash(8192), XSA-all, LeakyReLU², EMA, VE128, Partial RoPE, LN Scale, U-Net skips.",
"date": "2026-03-23T00:00:00Z",
"val_loss": 1.88607,
"val_bpb": 1.1164,
"pre_ttt_val_bpb": 1.1264,
"bytes_total": 15885838,
"seeds": {
"1337": {"val_bpb": 1.1157, "val_loss": 1.88564, "bytes_total": 15885838},
"42": {"val_bpb": 1.1163, "val_loss": 1.88630, "bytes_total": 15296667},
"7": {"val_bpb": 1.1172, "val_loss": 1.88629, "bytes_total": 15576393}
},
"mean_val_bpb": 1.1164,
"std_val_bpb": 0.0008,
"bytes_total_max": 15885838
}
Loading