Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) by dentity007 · Pull Request #1554 · openai/parameter-golf

dentity007 · 2026-04-11T21:56:10Z

Legal resubmission of PR #1193 per @MatoTeziTanka review

Status: Non-record research submission

Hardware: NVIDIA DGX Spark GB10 (single GPU, aarch64)

val_bpb: TBD (Spark run in progress, will update this PR when complete)

Background

PR #1193 (original Universal Transformer submission) was flagged on 2026-04-11 by @MatoTeziTanka for using an illegal TTT pattern. His review was thorough: the ttt_adapt() function trained multi-epoch on val_tokens without score-first discipline, matching the same pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.

His recommendation was clear: resubmit with the TTT function taking a training-data slice instead of val_tokens, per the PR #1416 / PR #1423 reference implementations. This PR does exactly that.

PR #1193 remains open with the honest no-TTT numbers (val_bpb 3.2483 from my clean Spark ablation) and a full acknowledgment of the flag. This PR is the proper legal version with TTT enabled.

What Changed

TTT function signature. ttt_adapt() now takes train_slice_tokens instead of val_tokens. The parameter name documents the intent explicitly.
Training slice source. Before the TTT call, the submission loads a fixed window from the tail of the last fineweb_train_*.bin shard. This slice was not used during main training (we only train on the prefix of each shard up to iterations steps) and is never part of fineweb_val_*.bin. No val_tokens touch the TTT gradient path at any point.
Evaluation unchanged. val_tokens are scored exactly once, in a single left-to-right pass, after all training (including TTT) has finished. TTT updates shift model weights but do not influence how val tokens are scored.

Universal Transformer Architecture (unchanged from #1193)

Single shared transformer block looped N times
Per-iteration learnable parameters: attn_scale, mlp_scale, resid_mix, iteration_embed
50 percent sparse-to-dense curriculum during training
Implements OpenAI's requested "Universal transformer" research direction from the README

Legality Argument

Issue #402 and Issue #677 rulings define illegal TTT as any training pass that updates model state based on val_tokens the model has not already been tested on. This submission satisfies the rules because:

The TTT gradient comes entirely from training-set tokens (a tail slice of the last fineweb_train shard)
Those training tokens are never scored as part of val_bpb
val_tokens are scored exactly once, after all training (including TTT) is complete
No eval-time leakage of val targets into training loss

The argument is structurally identical to the reference PRs #1416 and #1423 cited by @MatoTeziTanka in his review of #1193.

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
VOCAB_SIZE=1024 NUM_ITERS=6 TORCH_COMPILE_DISABLE=1 ITERATIONS=200 \
  TTT_ENABLED=1 TTT_EPOCHS=3 TTT_TRAIN_SLICE_SEQS=128 TTT_LR=0.0005 \
  python3 records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py

Hardware Notes

DGX Spark GB10 is approximately 6x slower per step than 8xH100. No torch.compile (Triton/inductor unsupported on aarch64), no flash_attn_interface (using scaled_dot_product_attention fallback). Absolute BPB will be higher than a competition 8xH100 run due to the short 200-step training budget, but the legality story holds across hardware.

PR Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193 Original (flagged, being kept open with honest no-TTT numbers)
PR Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416, Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423 (@aryanbhosale) Reference legal Pre-Quant TTT implementations
PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (@msisovic) Mini depth recurrence (partial weight sharing)
PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 (@aryanbhosale) Track A legal submission using parallel residuals + depth recurrence
Full ablation gist https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 (all 22 runs across 7 OpenAI-requested architectures)

Review Credit

@MatoTeziTanka flagged the original #1193 TTT-on-val issue via The Agora community compliance tracker (https://matotezitanka.github.io/parameter-golf/). This resubmission implements the exact fix he recommended. His unpaid community review work is exactly the kind of standards-enforcement this repo needs given the maintainer bandwidth constraints.

@MatoTeziTanka

…ce Variant) Legal-compliant resubmission of PR openai#1193 per @MatoTeziTanka review on 2026-04-11. Changes vs original PR openai#1193: - ttt_adapt() function signature takes train_slice_tokens instead of val_tokens - Call site loads TTT data from the tail of the last fineweb_train_*.bin shard (training data slice, never scored during eval) - All val_tokens references in TTT removed; val is only scored in the final single-pass evaluation after training + TTT finish - README documents the legality argument and references PR openai#1416 / openai#1423 Universal Transformer architecture itself is unchanged: - Single shared block looped N times with per-iteration scale/shift/resid_mix - 50% sparse-to-dense curriculum - Implements OpenAI's requested 'Universal transformer' research direction Final val_bpb pending DGX Spark run completion. Thanks to @MatoTeziTanka for the careful review via The Agora (https://matotezitanka.github.io/parameter-golf/).

Spark run completed on 2026-04-11. Key numbers: - Model params: 4,546,568 - Pre-quant val_bpb (step 200): 3.2483 - Post-TTT int6 roundtrip val_bpb: 3.4446 - TTT source: fineweb_train_000079.bin (last training shard tail) - TTT tokens: 131,073 (training data, NOT val_tokens) - TTT config: 3 epochs AdamW lr=0.0005 - TTT loss curve: 6.15 -> 5.89 -> 5.79 - Artifact size: 1.35 MB (int6+brotli-11) The legal TTT pattern works: TTT gradient came entirely from training data, val tokens were scored exactly once after all training finished. The explicit log line 'ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens)' confirms the fix.

dentity007 · 2026-04-12T01:47:38Z

Spark run completed. Final numbers pushed in commit 8896c0a.

Results

Metric	Value
Model params	4,546,568
Pre-quant val_bpb (step 200)	3.2483
Post-TTT int6 roundtrip val_bpb	3.4446
TTT source	fineweb_train_000079.bin (tail of last training shard)
TTT tokens	131,073 (training data, NOT val_tokens)
TTT config	3 epochs AdamW lr=0.0005
TTT loss curve	6.15 -> 5.89 -> 5.79
TTT duration	13.6 seconds
Artifact size	1.35 MB (int6+brotli-11)

Legality Verification

The training log explicitly confirms the fix:

ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens)
ttt:start lr=0.0005 momentum=0.9 epochs=3 freeze_blocks=0
ttt_epoch:1/3 loss:6.1531 time:4.1s
ttt_epoch:2/3 loss:5.8870 time:8.5s
ttt_epoch:3/3 loss:5.7944 time:13.0s
ttt:done elapsed=13.6s

The source log line proves the TTT gradient came entirely from training data. val_tokens were scored exactly once, after all training (including TTT) finished.

Notes on BPB Numbers

The 3.4446 number is high relative to competition runs because:

DGX Spark GB10 is a single GPU, not 8xH100 (roughly 6x slower per step)
Only 200 training steps instead of 5000+
No torch.compile (Triton unsupported on aarch64)
Universal Transformer with 6 shared-block iterations, 4.5M params

This is a research/non-record submission to document the architecture direction, not a competitive record attempt. The legality story holds across hardware, and the relative ordering between UT-1 (6 iters) and UT-2 (24 iters) in my broader ablation confirmed more iterations does not help.

Full ablation data across all 7 OpenAI-requested research architectures: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

MatoTeziTanka · 2026-04-12T05:13:55Z

Community Review — Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)

Compliance: LOOKS CLEAN — TTT trains on a held-out training-data slice, never touches val_tokens

What I found in the code (head SHA from PR #1554, file records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py):

The ttt_adapt() function (line 1198) takes train_slice_tokens — a tensor loaded from the tail of the last fineweb_train_*.bin shard (lines 1778–1787). This slice:

Was not used during main training (training only consumes the prefix up to iterations steps)
Is never part of fineweb_val_*.bin
Is never scored as part of val_bpb

The TTT loop (lines 1244–1274) runs multi-epoch AdamW on this training slice. Validation tokens are scored exactly once, after all training (including TTT) is complete, at line 1862 via eval_val_sliding() under torch.inference_mode().

This is cleanly legal. The TTT gradient path never touches val_tokens. The val_bpb number comes from a single forward-only scoring pass on data the model was never adapted on. The Issue #402 / #677 rules are satisfied because val_tokens are never the subject of loss.backward() at any point.

One citation correction: the PR description and docstring reference PRs #1416 and #1423 as "legal Pre-Quant TTT" implementations. I need to correct my own earlier review on #1193 — I cited those same PRs as legal references, and I was wrong. At their current heads, #1416 and #1423 contain the ILLEGAL flat-epoch ttt_adapt_adamw on val_tokens (despite folder names saying "LegalTTT"). The confirmed legal TTT reference is PR #1413 (dexhunter), which uses the score-first-per-chunk pattern. I posted a correction on Issue #677 on 2026-04-11.

That said — your implementation doesn't need the score-first-per-chunk pattern at all, because you're not training on val_tokens in the first place. Training on a held-out training slice and then scoring val_tokens is just... regular training with a second fine-tuning phase. It's structurally the cleanest approach.

tl;dr: My #1193 review gave you the right implementation advice (move TTT off val_tokens) but cited the wrong precedent PRs. You got the code right anyway. Apologies for the misleading citation — I've since corrected it.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual non-record checks. No compliance flags. This is a clean legal resubmission of the Universal Transformer architecture from #1193.

Reviewed by @MatoTeziTanka — The Agora. Manual code review (not auto-classified). Special attention given because this is a resubmission responding to our prior compliance flag on #1193 — wanted to make sure we didn't lead the author down a bad path. We didn't — the implementation is correct, just the citation needs updating.

dentity007 added 2 commits April 11, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:research/universal-transformer-legal-ttt

dentity007 commented Apr 11, 2026

Uh oh!

dentity007 commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Apr 11, 2026

Legal resubmission of PR #1193 per @MatoTeziTanka review

Background

What Changed

Universal Transformer Architecture (unchanged from #1193)

Legality Argument

Reproduction

Hardware Notes

Related

Review Credit

Uh oh!

dentity007 commented Apr 12, 2026

Results

Legality Verification

Notes on BPB Numbers

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants