Skip to content

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)#1326

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-v6-ttt
Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)#1326
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-v6-ttt

Conversation

@aryanbhosale
Copy link
Copy Markdown

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT

val_bpb = 1.0896 (3-seed mean, std 0.0008) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed Sliding BPB TTT BPB TTT gain Artifact
42 1.0896 1.0889 -0.0007 15,999,165
314 1.0915 1.0906 -0.0010 15,974,112
999 1.0901 1.0894 -0.0007 15,996,001
Mean 1.0896 -0.0008

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0251 BPB.

Key Techniques

  1. 4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
  2. Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  3. Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
  4. MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  5. QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
  6. Legal Score-First TTT — compiled scoring, torch.no_grad. PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 @Christopher-Lee-McClendon
  7. Full GPTQ int6 + Brotli + LZMA Compressed Wrapper (~25KB code)

Compliance

  • Legal score-first TTT (each token scored BEFORE weight updates)
  • No SLOT, no n-gram cache
  • GPTQ calibration within training budget
  • All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied
  • Total eval: ~400s (sliding ~100s + TTT ~300s), within 600s budget

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #461 @Christopher-Lee-McClendon

…egal TTT — val_bpb 1.0896 (3-seed mean)

SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R
+ QK-Gain 5.0 + legal score-first TTT + full GPTQ int6 + brotli.

3-seed mean: 1.0896 BPB, delta -0.0251 vs merged SOTA (PR openai#1019).
himanshudongre added a commit to himanshudongre/parameter-golf that referenced this pull request Apr 4, 2026
Evidence from 4 independent configurations (PR openai#461, PR openai#601, PR openai#1326, and
my own experiments) showing GPTQ's compensatory weight structure is destroyed
by SGD-based test-time training.

Key finding: SGD TTT gives -0.0165 BPB on simple int6 but provides negligible
to negative improvement on GPTQ-quantized models (-0.0001 to +0.030 BPB).

Includes complete SGD TTT implementation (sgd_ttt_eval.py) following PR openai#461
protocol, and LoRA TTT implementation (clark_ttt_eval.py).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanshudongre
Copy link
Copy Markdown

I independently confirmed that TTT provides negligible improvement on GPTQ-quantized models. My LoRA TTT (rank-8 on Q,V) on a GPTQ int6 Clark 11L model gave -0.0013 BPB — consistent with your finding of -0.0001 BPB here.

I wrote up a systematic analysis of why this happens: GPTQ's compensatory weight structure is destroyed by gradient-based updates. See PR #1341 for the full evidence table (4 configurations from 3 independent sources) and root cause analysis.

@aryanbhosale
Copy link
Copy Markdown
Author

aryanbhosale commented Apr 4, 2026

@himanshudongre Thanks for the independent confirmation. My experience matches exactly — post-quant TTT on GPTQ models gives negligible or even negative returns because SGD disrupts the carefully calibrated quantization structure.

I observed the same pattern across multiple attempts:

  • Post-quant SGD TTT: -0.0008 BPB (barely measurable, and only with compiled scoring to avoid a forward-path bug)
  • Pre-quant TTT (adapting EMA weights before GPTQ): actually made final BPB worse by +0.002 because the adapted weights quantized poorly

The conclusion is clear: GPTQ's error compensation creates a fragile weight structure that gradient updates destroy. TTT and GPTQ are fundamentally at odds.

Will check out your PR #1341 for the full analysis. This is a useful negative result for the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants