Skip to content

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554

Open
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:research/universal-transformer-legal-ttt
Open

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:research/universal-transformer-legal-ttt

Conversation

@dentity007
Copy link
Copy Markdown

Legal resubmission of PR #1193 per @MatoTeziTanka review

Status: Non-record research submission

Hardware: NVIDIA DGX Spark GB10 (single GPU, aarch64)

val_bpb: TBD (Spark run in progress, will update this PR when complete)

Background

PR #1193 (original Universal Transformer submission) was flagged on 2026-04-11 by @MatoTeziTanka for using an illegal TTT pattern. His review was thorough: the ttt_adapt() function trained multi-epoch on val_tokens without score-first discipline, matching the same pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.

His recommendation was clear: resubmit with the TTT function taking a training-data slice instead of val_tokens, per the PR #1416 / PR #1423 reference implementations. This PR does exactly that.

PR #1193 remains open with the honest no-TTT numbers (val_bpb 3.2483 from my clean Spark ablation) and a full acknowledgment of the flag. This PR is the proper legal version with TTT enabled.

What Changed

  1. TTT function signature. ttt_adapt() now takes train_slice_tokens instead of val_tokens. The parameter name documents the intent explicitly.

  2. Training slice source. Before the TTT call, the submission loads a fixed window from the tail of the last fineweb_train_*.bin shard. This slice was not used during main training (we only train on the prefix of each shard up to iterations steps) and is never part of fineweb_val_*.bin. No val_tokens touch the TTT gradient path at any point.

  3. Evaluation unchanged. val_tokens are scored exactly once, in a single left-to-right pass, after all training (including TTT) has finished. TTT updates shift model weights but do not influence how val tokens are scored.

Universal Transformer Architecture (unchanged from #1193)

  • Single shared transformer block looped N times
  • Per-iteration learnable parameters: attn_scale, mlp_scale, resid_mix, iteration_embed
  • 50 percent sparse-to-dense curriculum during training
  • Implements OpenAI's requested "Universal transformer" research direction from the README

Legality Argument

Issue #402 and Issue #677 rulings define illegal TTT as any training pass that updates model state based on val_tokens the model has not already been tested on. This submission satisfies the rules because:

  1. The TTT gradient comes entirely from training-set tokens (a tail slice of the last fineweb_train shard)
  2. Those training tokens are never scored as part of val_bpb
  3. val_tokens are scored exactly once, after all training (including TTT) is complete
  4. No eval-time leakage of val targets into training loss

The argument is structurally identical to the reference PRs #1416 and #1423 cited by @MatoTeziTanka in his review of #1193.

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
VOCAB_SIZE=1024 NUM_ITERS=6 TORCH_COMPILE_DISABLE=1 ITERATIONS=200 \
  TTT_ENABLED=1 TTT_EPOCHS=3 TTT_TRAIN_SLICE_SEQS=128 TTT_LR=0.0005 \
  python3 records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py

Hardware Notes

DGX Spark GB10 is approximately 6x slower per step than 8xH100. No torch.compile (Triton/inductor unsupported on aarch64), no flash_attn_interface (using scaled_dot_product_attention fallback). Absolute BPB will be higher than a competition 8xH100 run due to the short 200-step training budget, but the legality story holds across hardware.

Related

Review Credit

@MatoTeziTanka flagged the original #1193 TTT-on-val issue via The Agora community compliance tracker (https://matotezitanka.github.io/parameter-golf/). This resubmission implements the exact fix he recommended. His unpaid community review work is exactly the kind of standards-enforcement this repo needs given the maintainer bandwidth constraints.

…ce Variant)

Legal-compliant resubmission of PR openai#1193 per @MatoTeziTanka review on 2026-04-11.

Changes vs original PR openai#1193:
- ttt_adapt() function signature takes train_slice_tokens instead of val_tokens
- Call site loads TTT data from the tail of the last fineweb_train_*.bin shard
  (training data slice, never scored during eval)
- All val_tokens references in TTT removed; val is only scored in the final
  single-pass evaluation after training + TTT finish
- README documents the legality argument and references PR openai#1416 / openai#1423

Universal Transformer architecture itself is unchanged:
- Single shared block looped N times with per-iteration scale/shift/resid_mix
- 50% sparse-to-dense curriculum
- Implements OpenAI's requested 'Universal transformer' research direction

Final val_bpb pending DGX Spark run completion.
Thanks to @MatoTeziTanka for the careful review via The Agora
(https://matotezitanka.github.io/parameter-golf/).
Spark run completed on 2026-04-11. Key numbers:
- Model params: 4,546,568
- Pre-quant val_bpb (step 200): 3.2483
- Post-TTT int6 roundtrip val_bpb: 3.4446
- TTT source: fineweb_train_000079.bin (last training shard tail)
- TTT tokens: 131,073 (training data, NOT val_tokens)
- TTT config: 3 epochs AdamW lr=0.0005
- TTT loss curve: 6.15 -> 5.89 -> 5.79
- Artifact size: 1.35 MB (int6+brotli-11)

The legal TTT pattern works: TTT gradient came entirely from training
data, val tokens were scored exactly once after all training finished.
The explicit log line 'ttt:legal_slice source=fineweb_train_000079.bin
tokens=131073 (train data slice, not val_tokens)' confirms the fix.
@dentity007
Copy link
Copy Markdown
Author

Spark run completed. Final numbers pushed in commit 8896c0a.

Results

Metric Value
Model params 4,546,568
Pre-quant val_bpb (step 200) 3.2483
Post-TTT int6 roundtrip val_bpb 3.4446
TTT source fineweb_train_000079.bin (tail of last training shard)
TTT tokens 131,073 (training data, NOT val_tokens)
TTT config 3 epochs AdamW lr=0.0005
TTT loss curve 6.15 -> 5.89 -> 5.79
TTT duration 13.6 seconds
Artifact size 1.35 MB (int6+brotli-11)

Legality Verification

The training log explicitly confirms the fix:

ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens)
ttt:start lr=0.0005 momentum=0.9 epochs=3 freeze_blocks=0
ttt_epoch:1/3 loss:6.1531 time:4.1s
ttt_epoch:2/3 loss:5.8870 time:8.5s
ttt_epoch:3/3 loss:5.7944 time:13.0s
ttt:done elapsed=13.6s

The source log line proves the TTT gradient came entirely from training data. val_tokens were scored exactly once, after all training (including TTT) finished.

Notes on BPB Numbers

The 3.4446 number is high relative to competition runs because:

  1. DGX Spark GB10 is a single GPU, not 8xH100 (roughly 6x slower per step)
  2. Only 200 training steps instead of 5000+
  3. No torch.compile (Triton unsupported on aarch64)
  4. Universal Transformer with 6 shared-block iterations, 4.5M params

This is a research/non-record submission to document the architecture direction, not a competitive record attempt. The legality story holds across hardware, and the relative ordering between UT-1 (6 iters) and UT-2 (24 iters) in my broader ablation confirmed more iterations does not help.

Full ablation data across all 7 OpenAI-requested research architectures: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)

Compliance: LOOKS CLEAN — TTT trains on a held-out training-data slice, never touches val_tokens

What I found in the code (head SHA from PR #1554, file records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py):

The ttt_adapt() function (line 1198) takes train_slice_tokens — a tensor loaded from the tail of the last fineweb_train_*.bin shard (lines 1778–1787). This slice:

  • Was not used during main training (training only consumes the prefix up to iterations steps)
  • Is never part of fineweb_val_*.bin
  • Is never scored as part of val_bpb

The TTT loop (lines 1244–1274) runs multi-epoch AdamW on this training slice. Validation tokens are scored exactly once, after all training (including TTT) is complete, at line 1862 via eval_val_sliding() under torch.inference_mode().

This is cleanly legal. The TTT gradient path never touches val_tokens. The val_bpb number comes from a single forward-only scoring pass on data the model was never adapted on. The Issue #402 / #677 rules are satisfied because val_tokens are never the subject of loss.backward() at any point.

One citation correction: the PR description and docstring reference PRs #1416 and #1423 as "legal Pre-Quant TTT" implementations. I need to correct my own earlier review on #1193 — I cited those same PRs as legal references, and I was wrong. At their current heads, #1416 and #1423 contain the ILLEGAL flat-epoch ttt_adapt_adamw on val_tokens (despite folder names saying "LegalTTT"). The confirmed legal TTT reference is PR #1413 (dexhunter), which uses the score-first-per-chunk pattern. I posted a correction on Issue #677 on 2026-04-11.

That said — your implementation doesn't need the score-first-per-chunk pattern at all, because you're not training on val_tokens in the first place. Training on a held-out training slice and then scoring val_tokens is just... regular training with a second fine-tuning phase. It's structurally the cleanest approach.

tl;dr: My #1193 review gave you the right implementation advice (move TTT off val_tokens) but cited the wrong precedent PRs. You got the code right anyway. Apologies for the misleading citation — I've since corrected it.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual non-record checks. No compliance flags. This is a clean legal resubmission of the Universal Transformer architecture from #1193.


Reviewed by @MatoTeziTankaThe Agora. Manual code review (not auto-classified). Special attention given because this is a resubmission responding to our prior compliance flag on #1193 — wanted to make sure we didn't lead the author down a bad path. We didn't — the implementation is correct, just the citation needs updating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants