Skip to content

Non-record: 27M params at Int5 QAT / train larger, quantize harder (val_bpb=1.1418)#469

Closed
cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd:submission-d576-int5-qat
Closed

Non-record: 27M params at Int5 QAT / train larger, quantize harder (val_bpb=1.1418)#469
cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd:submission-d576-int5-qat

Conversation

@cmcdnd
Copy link
Copy Markdown

@cmcdnd cmcdnd commented Mar 22, 2026

Non-record: 27M params at Int5 QAT / train larger, quantize harder

val_bpb: 1.1418 (sliding window, stride=64) | 15.7 MB artifact | 8xH100 SXM, 600s

Approach

Train a larger 27M-param model (d=576 vs standard d=512, +23% parameters) and compress to int5 (32 levels)
instead of int6 (64 levels). Early int5 QAT activation (threshold 0.50 vs standard 0.10) gives ~1,700 steps of
adaptation instead of ~300.

Li et al. (ICML 2020) showed compressed large models beat lightly compressed small models at the same final size.
This submission validates that principle for parameter golf.

Standard approach This submission
Parameters 22M 27M (+23%)
Model dim 512 576
Heads / KV 8 / 4 9 / 3
Quantization int6 (64 levels) int5 (32 levels)
QAT activation Last ~4% Last ~25%

Results

Seed val_bpb (sliding s64) artifact_bytes
1337 1.1418 15,713,507

Pre-quant: 1.1515. Quantization gap: 0.010.

Single-seed submission — artifact size varies by seed (15.2–16.5 MB range).

Architecture

11 layers, d=576, 9 heads (hd=64), 3 KV (GQA 3:1), MLP 3x, relu², SmearGate, BigramHash, XSA last 4, Partial RoPE
16/64, U-Net skips, OrthoInit, Muon, FA3, SWA, warmdown 3500.

What's different

  1. More params via aggressive quantization: 27M at int5 instead of 22M at int6, same artifact budget
  2. Early QAT at threshold 0.50: critical for int5 — standard 4% window is too short for 32-level adaptation

Built on PR #315 (@jfprincz).

@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

train larger quantize harder is a sick concept honestly. the early qat activation at 50% threshold is interesting, most ppl barely give it any steps. whats the quant gap like compared to standard int6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants