Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)#1550
Open
translatingthename wants to merge 1 commit intoopenai:mainfrom
Conversation
…e — val_bpb 1.0587 (3-seed mean) Non-record submission. Pre-quant TTT violates Condition 3 of Issue openai#1017 (score-before-update). Submitted as technique study documenting: - Condition 3 boundary quantification (illegal TTT -0.044 vs legal -0.002) - Compiled TTT (torch.compile 2x speedup, applicable to legal TTT) - Artifact budget engineering (VE dim optimization, pruning analysis) 3-seed mean sliding BPB: 1.05869 (std 0.00038) All artifacts under 16,000,000 bytes. Zero pruning needed.
6028279 to
d217bf5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence
val_bpb = 1.0587 (3-seed mean, std 0.0004) | ~15.5 MB | 8xH100 SXM
Compliance
This is a non-record submission. The pre-quant TTT implementation (lines 2371-2458 of
train_gpt.py) runs 6 AdamW epochs on the full validation token stream before GPTQ quantization, then scores the same tokens via sliding window. This violates Condition 3 of Issue #1017 (score-before-update) and is structurally identical to the pattern in closed PR #1376 and withdrawn PR #1485.3-Seed Results
Contributions
1. Quantifying the Condition 3 boundary
Two measured points bound the illegal TTT contribution at -0.044 BPB (post-EMA 1.103 → post-GPTQ sliding 1.059):
For comparison, the legal score-first TTT in merged PR #1493 contributes approximately -0.002 BPB. This is not an apples-to-apples comparison — different optimizers, epoch counts, chunk sizes, and base models — but the order-of-magnitude gap illustrates why Condition 3 is load-bearing.
On the theoretical ceiling: Issue #1017 states: "Corpus-level TTT has a ceiling of approximately 0.0003 bits" — referring to the gain from closing the train-val distribution gap. However, the author also notes "a model that undertrained on the training distribution can still benefit." Legal TTT's -0.002 exceeds the 0.0003 distribution-gap ceiling because our 600s-capped model is undertrained — this is legitimate undertraining compensation, not memorization. Our illegal TTT's -0.044, however, is 22x larger than legal TTT on a similar architecture, a magnitude not explainable by undertraining compensation alone.
A per-epoch ablation (not performed) would strengthen this argument: linear scaling with epoch count = memorization signature; rapid saturation = generalization.
2. Compiled TTT (2x speedup)
torch.compile(dynamic=False, fullgraph=True)reduces TTT from ~860s to ~426s. Safe in train mode withtorch.autocast— notorch.inference_mode()tensor poisoning. Fresh model instance avoids rotary cache contamination. Applies equally to legal score-first TTT.3. Artifact budget engineering
With SP8192, fitting under 16MB required component-level analysis:
Compliance Statement
Violates Condition 3 of Issue #1017. Lines 2371-2458 run 6 AdamW epochs on full val stream before GPTQ. Same tokens scored afterward. No score-before-adapt discipline. Structurally identical to closed PR #1376 and withdrawn PR #1485.
Architecture
11L × 512d × 8H/4KV, MLP 4× (2048), LeakyReLU(0.5)², Partial RoPE (16/64), LN scale, tied embeddings, softcap=30. Depth recurrence L3-5 (14 virtual). Parallel residuals L7+. XSA all layers. VE dim=44 L9-10. SmearGate.
Reproduction
Credits
PR #1394 @clarkkev, PR #1493 @bigbag, PR #1485 @ndokutovich, PR #1412 @Robby955, PR #1204 @msisovic, PR #1285 @dexhunter
Checklist
records/track_10min_16mb/