Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT

aryanbhosale · 2026-04-04T06:33:15Z

val_bpb = 1.0896 (3-seed mean, std 0.0008) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed	Sliding BPB	TTT BPB	TTT gain	Artifact
42	1.0896	1.0889	-0.0007	15,999,165
314	1.0915	1.0906	-0.0010	15,974,112
999	1.0901	1.0894	-0.0007	15,996,001
Mean		1.0896	-0.0008

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0251 BPB.

Key Techniques

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Legal Score-First TTT — compiled scoring, torch.no_grad. PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 @Christopher-Lee-McClendon
Full GPTQ int6 + Brotli + LZMA Compressed Wrapper (~25KB code)

Compliance

Legal score-first TTT (each token scored BEFORE weight updates)
No SLOT, no n-gram cache
GPTQ calibration within training budget
All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied
Total eval: ~400s (sliding ~100s + TTT ~300s), within 600s budget

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #461 @Christopher-Lee-McClendon

…egal TTT — val_bpb 1.0896 (3-seed mean) SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0 + legal score-first TTT + full GPTQ int6 + brotli. 3-seed mean: 1.0896 BPB, delta -0.0251 vs merged SOTA (PR openai#1019).

Evidence from 4 independent configurations (PR openai#461, PR openai#601, PR openai#1326, and my own experiments) showing GPTQ's compensatory weight structure is destroyed by SGD-based test-time training. Key finding: SGD TTT gives -0.0165 BPB on simple int6 but provides negligible to negative improvement on GPTQ-quantized models (-0.0001 to +0.030 BPB). Includes complete SGD TTT implementation (sgd_ttt_eval.py) following PR openai#461 protocol, and LoRA TTT implementation (clark_ttt_eval.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanshudongre · 2026-04-04T12:14:29Z

I independently confirmed that TTT provides negligible improvement on GPTQ-quantized models. My LoRA TTT (rank-8 on Q,V) on a GPTQ int6 Clark 11L model gave -0.0013 BPB — consistent with your finding of -0.0001 BPB here.

I wrote up a systematic analysis of why this happens: GPTQ's compensatory weight structure is destroyed by gradient-based updates. See PR #1341 for the full evidence table (4 configurations from 3 independent sources) and root cause analysis.

aryanbhosale · 2026-04-04T12:29:03Z

@himanshudongre Thanks for the independent confirmation. My experience matches exactly — post-quant TTT on GPTQ models gives negligible or even negative returns because SGD disrupts the carefully calibrated quantization structure.

I observed the same pattern across multiple attempts:

Post-quant SGD TTT: -0.0008 BPB (barely measurable, and only with compiled scoring to avoid a forward-path bug)
Pre-quant TTT (adapting EMA weights before GPTQ): actually made final BPB worse by +0.002 because the adapted weights quantized poorly

The conclusion is clear: GPTQ's error compensation creates a fragile weight structure that gradient updates destroy. TTT and GPTQ are fundamentally at odds.

Will check out your PR #1341 for the full analysis. This is a useful negative result for the community.

aryanbhosale mentioned this pull request Apr 4, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Apr 4, 2026

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1338

Closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1339

Open

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

himanshudongre mentioned this pull request Apr 4, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)#1326