Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554
Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554dentity007 wants to merge 2 commits intoopenai:mainfrom
Conversation
…ce Variant) Legal-compliant resubmission of PR openai#1193 per @MatoTeziTanka review on 2026-04-11. Changes vs original PR openai#1193: - ttt_adapt() function signature takes train_slice_tokens instead of val_tokens - Call site loads TTT data from the tail of the last fineweb_train_*.bin shard (training data slice, never scored during eval) - All val_tokens references in TTT removed; val is only scored in the final single-pass evaluation after training + TTT finish - README documents the legality argument and references PR openai#1416 / openai#1423 Universal Transformer architecture itself is unchanged: - Single shared block looped N times with per-iteration scale/shift/resid_mix - 50% sparse-to-dense curriculum - Implements OpenAI's requested 'Universal transformer' research direction Final val_bpb pending DGX Spark run completion. Thanks to @MatoTeziTanka for the careful review via The Agora (https://matotezitanka.github.io/parameter-golf/).
Spark run completed on 2026-04-11. Key numbers: - Model params: 4,546,568 - Pre-quant val_bpb (step 200): 3.2483 - Post-TTT int6 roundtrip val_bpb: 3.4446 - TTT source: fineweb_train_000079.bin (last training shard tail) - TTT tokens: 131,073 (training data, NOT val_tokens) - TTT config: 3 epochs AdamW lr=0.0005 - TTT loss curve: 6.15 -> 5.89 -> 5.79 - Artifact size: 1.35 MB (int6+brotli-11) The legal TTT pattern works: TTT gradient came entirely from training data, val tokens were scored exactly once after all training finished. The explicit log line 'ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens)' confirms the fix.
|
Spark run completed. Final numbers pushed in commit 8896c0a. Results
Legality VerificationThe training log explicitly confirms the fix: The source log line proves the TTT gradient came entirely from training data. val_tokens were scored exactly once, after all training (including TTT) finished. Notes on BPB NumbersThe 3.4446 number is high relative to competition runs because:
This is a research/non-record submission to document the architecture direction, not a competitive record attempt. The legality story holds across hardware, and the relative ordering between UT-1 (6 iters) and UT-2 (24 iters) in my broader ablation confirmed more iterations does not help. Full ablation data across all 7 OpenAI-requested research architectures: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 |
Community Review — Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)Compliance: LOOKS CLEAN — TTT trains on a held-out training-data slice, never touches val_tokens What I found in the code (head SHA from PR #1554, file The
The TTT loop (lines 1244–1274) runs multi-epoch AdamW on this training slice. Validation tokens are scored exactly once, after all training (including TTT) is complete, at line 1862 via This is cleanly legal. The TTT gradient path never touches val_tokens. The val_bpb number comes from a single forward-only scoring pass on data the model was never adapted on. The Issue #402 / #677 rules are satisfied because val_tokens are never the subject of One citation correction: the PR description and docstring reference PRs #1416 and #1423 as "legal Pre-Quant TTT" implementations. I need to correct my own earlier review on #1193 — I cited those same PRs as legal references, and I was wrong. At their current heads, #1416 and #1423 contain the ILLEGAL flat-epoch That said — your implementation doesn't need the score-first-per-chunk pattern at all, because you're not training on val_tokens in the first place. Training on a held-out training slice and then scoring val_tokens is just... regular training with a second fine-tuning phase. It's structurally the cleanest approach. tl;dr: My #1193 review gave you the right implementation advice (move TTT off val_tokens) but cited the wrong precedent PRs. You got the code right anyway. Apologies for the misleading citation — I've since corrected it. Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual non-record checks. No compliance flags. This is a clean legal resubmission of the Universal Transformer architecture from #1193. Reviewed by @MatoTeziTanka — The Agora. Manual code review (not auto-classified). Special attention given because this is a resubmission responding to our prior compliance flag on #1193 — wanted to make sure we didn't lead the author down a bad path. We didn't — the implementation is correct, just the citation needs updating. |
Legal resubmission of PR #1193 per @MatoTeziTanka review
Status: Non-record research submission
Hardware: NVIDIA DGX Spark GB10 (single GPU, aarch64)
val_bpb: TBD (Spark run in progress, will update this PR when complete)
Background
PR #1193 (original Universal Transformer submission) was flagged on 2026-04-11 by @MatoTeziTanka for using an illegal TTT pattern. His review was thorough: the
ttt_adapt()function trained multi-epoch onval_tokenswithout score-first discipline, matching the same pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.His recommendation was clear: resubmit with the TTT function taking a training-data slice instead of
val_tokens, per the PR #1416 / PR #1423 reference implementations. This PR does exactly that.PR #1193 remains open with the honest no-TTT numbers (val_bpb 3.2483 from my clean Spark ablation) and a full acknowledgment of the flag. This PR is the proper legal version with TTT enabled.
What Changed
TTT function signature.
ttt_adapt()now takestrain_slice_tokensinstead ofval_tokens. The parameter name documents the intent explicitly.Training slice source. Before the TTT call, the submission loads a fixed window from the tail of the last
fineweb_train_*.binshard. This slice was not used during main training (we only train on the prefix of each shard up toiterationssteps) and is never part offineweb_val_*.bin. No val_tokens touch the TTT gradient path at any point.Evaluation unchanged.
val_tokensare scored exactly once, in a single left-to-right pass, after all training (including TTT) has finished. TTT updates shift model weights but do not influence how val tokens are scored.Universal Transformer Architecture (unchanged from #1193)
Legality Argument
Issue #402 and Issue #677 rulings define illegal TTT as any training pass that updates model state based on val_tokens the model has not already been tested on. This submission satisfies the rules because:
The argument is structurally identical to the reference PRs #1416 and #1423 cited by @MatoTeziTanka in his review of #1193.
Reproduction
Hardware Notes
DGX Spark GB10 is approximately 6x slower per step than 8xH100. No torch.compile (Triton/inductor unsupported on aarch64), no flash_attn_interface (using scaled_dot_product_attention fallback). Absolute BPB will be higher than a competition 8xH100 run due to the short 200-step training budget, but the legality story holds across hardware.
Related
Review Credit
@MatoTeziTanka flagged the original #1193 TTT-on-val issue via The Agora community compliance tracker (https://matotezitanka.github.io/parameter-golf/). This resubmission implements the exact fix he recommended. His unpaid community review work is exactly the kind of standards-enforcement this repo needs given the maintainer bandwidth constraints.