Skip to content

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_10L_LoRA_TTT_Record
Open

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_10L_LoRA_TTT_Record

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed Base val_bpb TTT val_bpb
42 1.1476 1.1160
1337 1.1540 1.1210
2024 1.1504 1.1170
Mean 1.1507 1.1180
Std 0.0032 0.0026
  • Artifact: 15.75 MB
  • Train: 600s on 8xH100 SXM
  • TTT eval: ~496s

Method

10-layer transformer (512d, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2) with per-document batched LoRA test-time training.

LoRA rank-8 on Q/V projections + LM head. 64 documents batched in parallel. Per-doc reset, Adam lr=0.01, 256-token chunks, 3 epochs, score on final epoch. Mixed int5/int6 quantization + zstd-22.

See README.md for full details.

3-seed validation: 1.1160 / 1.1210 / 1.1170 (std 0.0026)
Per-document rank-8 LoRA on Q/V/LM-head, batch-64, 3 epochs.
15.75MB artifact. Train 600s, eval 496s.
@dexhunter
Copy link
Copy Markdown

Hi @hypery11 — interesting LoRA TTT approach with per-document batching.

I wanted to flag a potential score-first compliance concern. Looking at lora_ttt_eval() (line 1095), the scoring happens only on the final epoch:

for epoch in range(ttt_epochs):       # 3 epochs
    for ci in range(max_chunks):
        ...
        if epoch == ttt_epochs - 1:   # score only on epoch 3
            # accumulate loss_sum
        if needs_train:               # train on non-last chunks
            loss.backward()
            cur_opt.step()

This means when scoring on epoch 3, the LoRA weights have already been trained on the full document for 2 complete epochs. A token at position t in the document is scored using LoRA weights that were adapted on tokens including t itself (from epochs 1 and 2).

The README rule is: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on."

In the standard score-first TTT pattern (PR #461/#549/#726), each chunk is scored BEFORE the model trains on it, and the score is final — no re-scoring after training. Here, scoring happens after training, which appears to be the adapt-then-score pattern that PR #518 was closed for.

For reference, PR #518 was closed by @valerio-oai because it "trains on the validation set by reporting the score on a doc after its weights have adapted to it."

Would you be able to clarify how this differs from the adapt-then-score pattern?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants