Legality check: Pre-eval TTT (train-then-score) in top neural submissions

## Summary

Several top-ranked "neural-only" submissions appear to use **pre-eval TTT** — training on validation tokens first, then scoring them — rather than the **score-first TTT** approach validated in PRs #461 and #549.

Issue #402 established that pre-eval TTT is equivalent to training on the validation data and is not permitted. Issue #677 further clarified that multi-epoch TTT where the final-epoch score is reported is invalid.

## Specific concern: PR #672 (1.0781 BPB, 3-seed)

PR #672's code (lines ~1470-1535 in `train_gpt.py`) implements TTT as:

```python
# Train on ALL val tokens for 30 epochs
for ep in range(ttt_epochs):  # ttt_epochs=30
    for bs in range(rank_start, rank_end - seq_len, ttt_batch * seq_len):
        local = val_tokens[bs:be].to(device=device, dtype=torch.int64)
        x = local[:n * seq_len].reshape(n, seq_len)
        y = local[1:n * seq_len + 1].reshape(n, seq_len)
        loss = eval_model(x, y)
        loss.backward()
        optimizer.step()

# THEN score with sliding window (every token scored by model trained ON those tokens)
sw_val_loss, sw_val_bpb = eval_val_sliding(...)
```

This trains on the entire validation set for 30 epochs, then scores ALL tokens with the adapted model. Every scored token was seen during training. This is **train-then-score**, not **score-then-train**.

Compare with the legal score-first approach (PR #549/SOTA):

```python
for each chunk:
    Phase 1: SCORE chunk under inference_mode()  # no gradients, no weight changes
    Phase 2: TRAIN on already-scored chunk        # legal — tokens already graded
```

## Why this matters

The difference between these approaches is ~0.04 BPB (PR #672 at 1.078 vs score-first PRs at ~1.12). Pre-eval TTT accounts for almost all of the top "neural-only" leaderboard gains.

If pre-eval TTT is legal, the competition landscape changes fundamentally — everyone should switch to it. If it's illegal (per #402/#677), several top submissions need review.

## Affected PRs (may need review)

- **PR #672** (1.0781) — explicitly uses 30-epoch train-then-score
- Potentially others in the sub-1.10 range using similar TTT variants

## Request

Could @valerio-oai or @0hq clarify whether this style of TTT (train on full val set, then score) is considered legal under the current rules? The existing rulings in #402 and #677 seem to prohibit it, but PR #672 has been open since March 25 without being flagged.

Not trying to get anyone's work invalidated — just want clarity so everyone competes under the same rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legality check: Pre-eval TTT (train-then-score) in top neural submissions #1082

Summary

Specific concern: PR #672 (1.0781 BPB, 3-seed)

Why this matters

Affected PRs (may need review)

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Legality check: Pre-eval TTT (train-then-score) in top neural submissions #1082

Description

Summary

Specific concern: PR #672 (1.0781 BPB, 3-seed)

Why this matters

Affected PRs (may need review)

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions