Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164) by cmcdnd · Pull Request #576 · openai/parameter-golf

cmcdnd · 2026-03-23T21:43:38Z

Train Larger, Quantize Harder: 33.6M params quantized to int5 with full Hessian GPTQ, fitting in 15.6MB. Adds post-TTT temperature calibration (T=0.98) which corrects TTT-induced overconfidence for an additional -0.003 BPB - a novel technique not used in prior submissions.

Builds on my int5 QAT approach from PR #469 (first one with more params) and the 33.6M architecture from PR #545.

Architecture: 11L, 512d, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers, LeakyReLU², VE128
Quantization: Int5 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997 + 2% pruning
Eval: Score-first TTT at T=1.0 (AdamW, lr=1e-4, chunk=131K, last 2 blocks) → re-score at T=0.98

Score-first TTT systematically makes the model overconfident. A fixed temperature T=0.98 on post-TTT logits recovers ~0.003 BPB at zero cost:

Stage	Seed 1337 BPB	Delta
Post-quant sliding (s=64)	1.1259	—
+ TTT (T=1.0)	1.1190	-0.0069
+ T=0.98 re-score	1.1157	-0.0033

Results

Seed	Pre-TTT BPB	TTT BPB (T=1.0)	Final BPB (T=0.98)	val_loss	Artifact
1337	1.1259	1.1190	1.1157	1.8856	15.89 MB
42	1.1264	1.1196	1.1163	1.8863	15.30 MB
7	1.1271	1.1204	1.1172	1.8863	15.58 MB
Mean	1.1264	1.1197	1.1164	1.8861	—

SOTA improvement: 1.8958 - 1.8861 = 0.0097 nats (threshold: 0.005, p << 0.01)

Reproduction

pip install --break-system-packages zstandard
export PATH=/usr/local/cuda/bin:$PATH
cd /tmp && git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper && FLASH_ATTENTION_FORCE_BUILD=TRUE \
  FLASH_ATTENTION_DISABLE_SPLIT=TRUE FLASH_ATTENTION_DISABLE_PAGEDKV=TRUE \
  FLASH_ATTENTION_DISABLE_APPENDKV=TRUE FLASH_ATTENTION_DISABLE_LOCAL=TRUE \
  FLASH_ATTENTION_DISABLE_SOFTCAP=TRUE FLASH_ATTENTION_DISABLE_FP16=TRUE \
  FLASH_ATTENTION_DISABLE_FP8=TRUE FLASH_ATTENTION_DISABLE_VARLEN=TRUE \
  MAX_JOBS=8 pip install . --no-build-isolation --break-system-packages

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Eval Time Budget (~380s, under 10 min limit)

Sliding window (s=64): ~81s
Score-first TTT at T=1.0 (131K chunks, 3 epochs): ~298s
Post-TTT re-score at T=0.98: ~81s

…pb: 1.1164)

Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-24T14:44:51Z

It looks to me like this code uses training data at eval time due to the post-pruning calibration scheme, so I think this submission is invalid.

Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Autoresearch loop (program.md, loop.sh, generate_next.py) - Modal provider for 8xH100 training with checkpoint save/restore - Experiment framework with preflight size checks - eval_ttt.py for TTT evaluation against saved checkpoints - train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning) - train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT) - train_gpt_sota.py: PR openai#573 base - train_gpt_mlx_recurrent.py: depth recurrence experiments - Benchmark scripts for local MLX A/B testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ensation Implement GPTQ (Hessian-aware) quantization for int5 (31 levels, clip=15). Uses Cholesky-based error redistribution across columns for minimal quant damage. Calibrates on 256 training sequences. Enables fitting 12L+ models within 16MB artifact limit. Controlled by GPTQ_ENABLED=1 (default: off). Based on PR openai#576's technique (1.1162 BPB with 33.6M int5 params). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

12L models: - int6: 1.1139 BPB (17.56MB, over limit) - int5 GPTQ: 1.1254 BPB (14.24MB, fits but +0.011 damage) - int5 GPTQ aligned QAT: 1.1254 BPB (same, alignment didn't help) - No bigram: 1.1153 BPB (16.53MB, still over) 11L int6 GPTQ: 1.1293 BPB (GPTQ hurts int6) Key finding: int5 quantization damage is ~+0.012 BPB even with GPTQ. Need PR openai#576's Soft-Round QAT (tanh-based) for better alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Train larger (33.6M params, d=576, MLP 3.5x), quantize harder (int5 GPTQ). Legal score-first TTT (AdamW, cosine LR, 3 epochs) + post-TTT temperature calibration (T=0.98). 3-seed mean 1.1145 BPB (std 0.0003). Based on PR openai#576. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512 Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result Approach C (GEPA int5 + TTT): artifact over 16MB Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this). Only s_0 cumulative first-pass score is legal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Train Larger, Quantize Harder — 33.6M params in 15.9MB (val_b…

011a099

…pb: 1.1164)

cmcdnd changed the title ~~Record: Train Larger, Quantize Harder - 33.6M params / int5 GPTQ / (val_bpb: 1.1164)~~ Record: Train Larger, Quantize Harder - 33.6M params int5 GPTQ / (val_bpb: 1.1164) Mar 23, 2026

cmcdnd changed the title ~~Record: Train Larger, Quantize Harder - 33.6M params int5 GPTQ / (val_bpb: 1.1164)~~ Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164) Mar 23, 2026

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

valerio-oai closed this Mar 24, 2026

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

ibarrajo mentioned this pull request Mar 28, 2026

Record: 33.6M Int5 GPTQ + Score-First TTT (val_bpb=1.1145, 3-seed) #991

Closed

5 tasks

This was referenced Mar 28, 2026

Non-record: Three Approaches + Lessons Learned (best: 1.1188 BPB) #1001

Open

Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182) #1004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576

Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576
cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd:submission/train-larger-quantize-harder

cmcdnd commented Mar 23, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmcdnd commented Mar 23, 2026

Results

Reproduction

Eval Time Budget (~380s, under 10 min limit)

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants