Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence

translatingthename · 2026-04-11T18:14:33Z

val_bpb = 1.0587 (3-seed mean, std 0.0004) | ~15.5 MB | 8xH100 SXM

Compliance

This is a non-record submission. The pre-quant TTT implementation (lines 2371-2458 of train_gpt.py) runs 6 AdamW epochs on the full validation token stream before GPTQ quantization, then scores the same tokens via sliding window. This violates Condition 3 of Issue #1017 (score-before-update) and is structurally identical to the pattern in closed PR #1376 and withdrawn PR #1485.

3-Seed Results

Seed	Sliding BPB	Roundtrip BPB	Artifact
42	1.05840	1.06847	15,477,275
1337	1.05856	1.06904	15,439,370
2024	1.05912	1.06921	15,480,770
Mean	1.05869	1.06891	15,465,805
Std	0.00038	0.00037

Contributions

1. Quantifying the Condition 3 boundary

Two measured points bound the illegal TTT contribution at -0.044 BPB (post-EMA 1.103 → post-GPTQ sliding 1.059):

Configuration	BPB	Measured?
Post-EMA (no TTT, no GPTQ)	1.1028	Yes
Post-GPTQ sliding (illegal 6-epoch TTT)	1.0587	Yes
Post-GPTQ sliding (no TTT)	~1.106	Estimated
Post-GPTQ sliding (legal score-first TTT)	~1.104	Estimated from PR #1493 delta

For comparison, the legal score-first TTT in merged PR #1493 contributes approximately -0.002 BPB. This is not an apples-to-apples comparison — different optimizers, epoch counts, chunk sizes, and base models — but the order-of-magnitude gap illustrates why Condition 3 is load-bearing.

On the theoretical ceiling: Issue #1017 states: "Corpus-level TTT has a ceiling of approximately 0.0003 bits" — referring to the gain from closing the train-val distribution gap. However, the author also notes "a model that undertrained on the training distribution can still benefit." Legal TTT's -0.002 exceeds the 0.0003 distribution-gap ceiling because our 600s-capped model is undertrained — this is legitimate undertraining compensation, not memorization. Our illegal TTT's -0.044, however, is 22x larger than legal TTT on a similar architecture, a magnitude not explainable by undertraining compensation alone.

A per-epoch ablation (not performed) would strengthen this argument: linear scaling with epoch count = memorization signature; rapid saturation = generalization.

2. Compiled TTT (2x speedup)

torch.compile(dynamic=False, fullgraph=True) reduces TTT from ~860s to ~426s. Safe in train mode with torch.autocast — no torch.inference_mode() tensor poisoning. Fresh model instance avoids rotary cache contamination. Applies equally to legal score-first TTT.

3. Artifact budget engineering

With SP8192, fitting under 16MB required component-level analysis:

BigramHash (2048×128): +109KB compressed, ~0.001 BPB at SP8192. Dropped.
VE dim 128→44: saves 340KB, loses ~0.001 BPB. Optimal via EV analysis (0% pruning risk, 39KB margin).
Aggressive ±1 pruning destroys quality: 18% pruning = +0.054 BPB. Eliminating pruning via component selection is critical.

Compliance Statement

Violates Condition 3 of Issue #1017. Lines 2371-2458 run 6 AdamW epochs on full val stream before GPTQ. Same tokens scored afterward. No score-before-adapt discipline. Structurally identical to closed PR #1376 and withdrawn PR #1485.

Architecture

11L × 512d × 8H/4KV, MLP 4× (2048), LeakyReLU(0.5)², Partial RoPE (16/64), LN scale, tied embeddings, softcap=30. Depth recurrence L3-5 (14 virtual). Parallel residuals L7+. XSA all layers. VE dim=44 L9-10. SmearGate.

Reproduction

pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 BIGRAM_VOCAB_SIZE=0 VE_DIM=44 TTT_ENABLED=1 SEED=42 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1394 @clarkkev, PR #1493 @bigbag, PR #1485 @ndokutovich, PR #1412 @Robby955, PR #1204 @msisovic, PR #1285 @dexhunter

Checklist

Folder under records/track_10min_16mb/
README.md, submission.json, train_gpt.py, 3 seed logs
All artifacts < 16,000,000 bytes, train < 600s
Non-record: Condition 3 violation documented

…e — val_bpb 1.0587 (3-seed mean) Non-record submission. Pre-quant TTT violates Condition 3 of Issue openai#1017 (score-before-update). Submitted as technique study documenting: - Condition 3 boundary quantification (illegal TTT -0.044 vs legal -0.002) - Compiled TTT (torch.compile 2x speedup, applicable to legal TTT) - Artifact budget engineering (VE dim optimization, pruning analysis) 3-seed mean sliding BPB: 1.05869 (std 0.00038) All artifacts under 16,000,000 bytes. Zero pruning needed.

translatingthename force-pushed the submission/sp8192-prequant-ttt-1.0587 branch from 6028279 to d217bf5 Compare April 11, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)#1550