-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Summary
Several top-ranked "neural-only" submissions appear to use pre-eval TTT — training on validation tokens first, then scoring them — rather than the score-first TTT approach validated in PRs #461 and #549.
Issue #402 established that pre-eval TTT is equivalent to training on the validation data and is not permitted. Issue #677 further clarified that multi-epoch TTT where the final-epoch score is reported is invalid.
Specific concern: PR #672 (1.0781 BPB, 3-seed)
PR #672's code (lines ~1470-1535 in train_gpt.py) implements TTT as:
# Train on ALL val tokens for 30 epochs
for ep in range(ttt_epochs): # ttt_epochs=30
for bs in range(rank_start, rank_end - seq_len, ttt_batch * seq_len):
local = val_tokens[bs:be].to(device=device, dtype=torch.int64)
x = local[:n * seq_len].reshape(n, seq_len)
y = local[1:n * seq_len + 1].reshape(n, seq_len)
loss = eval_model(x, y)
loss.backward()
optimizer.step()
# THEN score with sliding window (every token scored by model trained ON those tokens)
sw_val_loss, sw_val_bpb = eval_val_sliding(...)This trains on the entire validation set for 30 epochs, then scores ALL tokens with the adapted model. Every scored token was seen during training. This is train-then-score, not score-then-train.
Compare with the legal score-first approach (PR #549/SOTA):
for each chunk:
Phase 1: SCORE chunk under inference_mode() # no gradients, no weight changes
Phase 2: TRAIN on already-scored chunk # legal — tokens already gradedWhy this matters
The difference between these approaches is ~0.04 BPB (PR #672 at 1.078 vs score-first PRs at ~1.12). Pre-eval TTT accounts for almost all of the top "neural-only" leaderboard gains.
If pre-eval TTT is legal, the competition landscape changes fundamentally — everyone should switch to it. If it's illegal (per #402/#677), several top submissions need review.
Affected PRs (may need review)
- PR Record: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781) #672 (1.0781) — explicitly uses 30-epoch train-then-score
- Potentially others in the sub-1.10 range using similar TTT variants
Request
Could @valerio-oai or @0hq clarify whether this style of TTT (train on full val set, then score) is considered legal under the current rules? The existing rulings in #402 and #677 seem to prohibit it, but PR #672 has been open since March 25 without being flagged.
Not trying to get anyone's work invalidated — just want clarity so everyone competes under the same rules.