Skip to content

Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576

Closed
cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd:submission/train-larger-quantize-harder
Closed

Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576
cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd:submission/train-larger-quantize-harder

Conversation

@cmcdnd
Copy link
Copy Markdown

@cmcdnd cmcdnd commented Mar 23, 2026

Train Larger, Quantize Harder: 33.6M params quantized to int5 with full Hessian GPTQ, fitting in 15.6MB. Adds post-TTT temperature calibration (T=0.98) which corrects TTT-induced overconfidence for an additional -0.003 BPB - a novel technique not used in prior submissions.

Builds on my int5 QAT approach from PR #469 (first one with more params) and the 33.6M architecture from PR #545.

Architecture: 11L, 512d, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers, LeakyReLU², VE128
Quantization: Int5 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997 + 2% pruning
Eval: Score-first TTT at T=1.0 (AdamW, lr=1e-4, chunk=131K, last 2 blocks) → re-score at T=0.98

Score-first TTT systematically makes the model overconfident. A fixed temperature T=0.98 on post-TTT logits recovers ~0.003 BPB at zero cost:

Stage Seed 1337 BPB Delta
Post-quant sliding (s=64) 1.1259
+ TTT (T=1.0) 1.1190 -0.0069
+ T=0.98 re-score 1.1157 -0.0033

Results

Seed Pre-TTT BPB TTT BPB (T=1.0) Final BPB (T=0.98) val_loss Artifact
1337 1.1259 1.1190 1.1157 1.8856 15.89 MB
42 1.1264 1.1196 1.1163 1.8863 15.30 MB
7 1.1271 1.1204 1.1172 1.8863 15.58 MB
Mean 1.1264 1.1197 1.1164 1.8861

SOTA improvement: 1.8958 - 1.8861 = 0.0097 nats (threshold: 0.005, p << 0.01)

Reproduction

pip install --break-system-packages zstandard
export PATH=/usr/local/cuda/bin:$PATH
cd /tmp && git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper && FLASH_ATTENTION_FORCE_BUILD=TRUE \
  FLASH_ATTENTION_DISABLE_SPLIT=TRUE FLASH_ATTENTION_DISABLE_PAGEDKV=TRUE \
  FLASH_ATTENTION_DISABLE_APPENDKV=TRUE FLASH_ATTENTION_DISABLE_LOCAL=TRUE \
  FLASH_ATTENTION_DISABLE_SOFTCAP=TRUE FLASH_ATTENTION_DISABLE_FP16=TRUE \
  FLASH_ATTENTION_DISABLE_FP8=TRUE FLASH_ATTENTION_DISABLE_VARLEN=TRUE \
  MAX_JOBS=8 pip install . --no-build-isolation --break-system-packages

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Eval Time Budget (~380s, under 10 min limit)

  • Sliding window (s=64): ~81s
  • Score-first TTT at T=1.0 (131K chunks, 3 epochs): ~298s
  • Post-TTT re-score at T=0.98: ~81s

@cmcdnd cmcdnd changed the title Record: Train Larger, Quantize Harder - 33.6M params / int5 GPTQ / (val_bpb: 1.1164) Record: Train Larger, Quantize Harder - 33.6M params int5 GPTQ / (val_bpb: 1.1164) Mar 23, 2026
@cmcdnd cmcdnd changed the title Record: Train Larger, Quantize Harder - 33.6M params int5 GPTQ / (val_bpb: 1.1164) Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164) Mar 23, 2026
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds:
- temperature param to eval_val_sliding (default 1.0, no change)
- After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99}
- PR openai#576 reported T=0.98 gives -0.003 bpb for free

10 lines added over Run 1. Zero training cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Builds on Run 2. Changes from PR openai#414 base:
- MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params)
- Quantization: int6 → int5 (clip_range 31→15, fits more params)
- QAT: enabled with threshold 0.5 (early start, matching PR openai#576)
- QAT uses quantile(0.9995) clip instead of row max
- BigramHash: 2048 → 8192 buckets

From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb).
8 lines changed from Run 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

It looks to me like this code uses training data at eval time due to the post-pruning calibration scheme, so I think this submission is invalid.

sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds:
- temperature param to eval_val_sliding (default 1.0, no change)
- After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99}
- PR openai#576 reported T=0.98 gives -0.003 bpb for free

10 lines added over Run 1. Zero training cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 24, 2026
Builds on Run 2. Changes from PR openai#414 base:
- MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params)
- Quantization: int6 → int5 (clip_range 31→15, fits more params)
- QAT: enabled with threshold 0.5 (early start, matching PR openai#576)
- QAT uses quantile(0.9995) clip instead of row max
- BigramHash: 2048 → 8192 buckets

From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb).
8 lines changed from Run 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nishant-resolve-ai pushed a commit to nishant-resolve-ai/parameter-golf that referenced this pull request Mar 24, 2026
- Autoresearch loop (program.md, loop.sh, generate_next.py)
- Modal provider for 8xH100 training with checkpoint save/restore
- Experiment framework with preflight size checks
- eval_ttt.py for TTT evaluation against saved checkpoints
- train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning)
- train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT)
- train_gpt_sota.py: PR openai#573 base
- train_gpt_mlx_recurrent.py: depth recurrence experiments
- Benchmark scripts for local MLX A/B testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 25, 2026
…ensation

Implement GPTQ (Hessian-aware) quantization for int5 (31 levels, clip=15).
Uses Cholesky-based error redistribution across columns for minimal quant
damage. Calibrates on 256 training sequences.

Enables fitting 12L+ models within 16MB artifact limit.
Controlled by GPTQ_ENABLED=1 (default: off).

Based on PR openai#576's technique (1.1162 BPB with 33.6M int5 params).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 25, 2026
12L models:
- int6: 1.1139 BPB (17.56MB, over limit)
- int5 GPTQ: 1.1254 BPB (14.24MB, fits but +0.011 damage)
- int5 GPTQ aligned QAT: 1.1254 BPB (same, alignment didn't help)
- No bigram: 1.1153 BPB (16.53MB, still over)

11L int6 GPTQ: 1.1293 BPB (GPTQ hurts int6)

Key finding: int5 quantization damage is ~+0.012 BPB even with GPTQ.
Need PR openai#576's Soft-Round QAT (tanh-based) for better alignment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ibarrajo added a commit to ibarrajo/parameter-golf that referenced this pull request Mar 28, 2026
Train larger (33.6M params, d=576, MLP 3.5x), quantize harder (int5 GPTQ).
Legal score-first TTT (AdamW, cosine LR, 3 epochs) + post-TTT temperature
calibration (T=0.98). 3-seed mean 1.1145 BPB (std 0.0003). Based on PR openai#576.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ibarrajo added a commit to ibarrajo/parameter-golf that referenced this pull request Mar 28, 2026
Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512
Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result
Approach C (GEPA int5 + TTT): artifact over 16MB

Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this).
Only s_0 cumulative first-pass score is legal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants