Skip to content

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)#1550

Open
translatingthename wants to merge 1 commit intoopenai:mainfrom
translatingthename:submission/sp8192-prequant-ttt-1.0587
Open

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)#1550
translatingthename wants to merge 1 commit intoopenai:mainfrom
translatingthename:submission/sp8192-prequant-ttt-1.0587

Conversation

@translatingthename
Copy link
Copy Markdown

@translatingthename translatingthename commented Apr 11, 2026

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence

val_bpb = 1.0587 (3-seed mean, std 0.0004) | ~15.5 MB | 8xH100 SXM

Compliance

This is a non-record submission. The pre-quant TTT implementation (lines 2371-2458 of train_gpt.py) runs 6 AdamW epochs on the full validation token stream before GPTQ quantization, then scores the same tokens via sliding window. This violates Condition 3 of Issue #1017 (score-before-update) and is structurally identical to the pattern in closed PR #1376 and withdrawn PR #1485.

3-Seed Results

Seed Sliding BPB Roundtrip BPB Artifact
42 1.05840 1.06847 15,477,275
1337 1.05856 1.06904 15,439,370
2024 1.05912 1.06921 15,480,770
Mean 1.05869 1.06891 15,465,805
Std 0.00038 0.00037

Contributions

1. Quantifying the Condition 3 boundary

Two measured points bound the illegal TTT contribution at -0.044 BPB (post-EMA 1.103 → post-GPTQ sliding 1.059):

Configuration BPB Measured?
Post-EMA (no TTT, no GPTQ) 1.1028 Yes
Post-GPTQ sliding (illegal 6-epoch TTT) 1.0587 Yes
Post-GPTQ sliding (no TTT) ~1.106 Estimated
Post-GPTQ sliding (legal score-first TTT) ~1.104 Estimated from PR #1493 delta

For comparison, the legal score-first TTT in merged PR #1493 contributes approximately -0.002 BPB. This is not an apples-to-apples comparison — different optimizers, epoch counts, chunk sizes, and base models — but the order-of-magnitude gap illustrates why Condition 3 is load-bearing.

On the theoretical ceiling: Issue #1017 states: "Corpus-level TTT has a ceiling of approximately 0.0003 bits" — referring to the gain from closing the train-val distribution gap. However, the author also notes "a model that undertrained on the training distribution can still benefit." Legal TTT's -0.002 exceeds the 0.0003 distribution-gap ceiling because our 600s-capped model is undertrained — this is legitimate undertraining compensation, not memorization. Our illegal TTT's -0.044, however, is 22x larger than legal TTT on a similar architecture, a magnitude not explainable by undertraining compensation alone.

A per-epoch ablation (not performed) would strengthen this argument: linear scaling with epoch count = memorization signature; rapid saturation = generalization.

2. Compiled TTT (2x speedup)

torch.compile(dynamic=False, fullgraph=True) reduces TTT from ~860s to ~426s. Safe in train mode with torch.autocast — no torch.inference_mode() tensor poisoning. Fresh model instance avoids rotary cache contamination. Applies equally to legal score-first TTT.

3. Artifact budget engineering

With SP8192, fitting under 16MB required component-level analysis:

  • BigramHash (2048×128): +109KB compressed, ~0.001 BPB at SP8192. Dropped.
  • VE dim 128→44: saves 340KB, loses ~0.001 BPB. Optimal via EV analysis (0% pruning risk, 39KB margin).
  • Aggressive ±1 pruning destroys quality: 18% pruning = +0.054 BPB. Eliminating pruning via component selection is critical.

Compliance Statement

Violates Condition 3 of Issue #1017. Lines 2371-2458 run 6 AdamW epochs on full val stream before GPTQ. Same tokens scored afterward. No score-before-adapt discipline. Structurally identical to closed PR #1376 and withdrawn PR #1485.

Architecture

11L × 512d × 8H/4KV, MLP 4× (2048), LeakyReLU(0.5)², Partial RoPE (16/64), LN scale, tied embeddings, softcap=30. Depth recurrence L3-5 (14 virtual). Parallel residuals L7+. XSA all layers. VE dim=44 L9-10. SmearGate.

Reproduction

pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 BIGRAM_VOCAB_SIZE=0 VE_DIM=44 TTT_ENABLED=1 SEED=42 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1394 @clarkkev, PR #1493 @bigbag, PR #1485 @ndokutovich, PR #1412 @Robby955, PR #1204 @msisovic, PR #1285 @dexhunter

Checklist

  • Folder under records/track_10min_16mb/
  • README.md, submission.json, train_gpt.py, 3 seed logs
  • All artifacts < 16,000,000 bytes, train < 600s
  • Non-record: Condition 3 violation documented

…e — val_bpb 1.0587 (3-seed mean)

Non-record submission. Pre-quant TTT violates Condition 3 of Issue openai#1017
(score-before-update). Submitted as technique study documenting:
- Condition 3 boundary quantification (illegal TTT -0.044 vs legal -0.002)
- Compiled TTT (torch.compile 2x speedup, applicable to legal TTT)
- Artifact budget engineering (VE dim optimization, pruning analysis)

3-seed mean sliding BPB: 1.05869 (std 0.00038)
All artifacts under 16,000,000 bytes. Zero pruning needed.
@translatingthename translatingthename force-pushed the submission/sp8192-prequant-ttt-1.0587 branch from 6028279 to d217bf5 Compare April 11, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant