Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713
Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713hypery11 wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed validation: 1.1160 / 1.1210 / 1.1170 (std 0.0026) Per-document rank-8 LoRA on Q/V/LM-head, batch-64, 3 epochs. 15.75MB artifact. Train 600s, eval 496s.
|
Hi @hypery11 — interesting LoRA TTT approach with per-document batching. I wanted to flag a potential score-first compliance concern. Looking at for epoch in range(ttt_epochs): # 3 epochs
for ci in range(max_chunks):
...
if epoch == ttt_epochs - 1: # score only on epoch 3
# accumulate loss_sum
if needs_train: # train on non-last chunks
loss.backward()
cur_opt.step()This means when scoring on epoch 3, the LoRA weights have already been trained on the full document for 2 complete epochs. A token at position t in the document is scored using LoRA weights that were adapted on tokens including t itself (from epochs 1 and 2). The README rule is: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on." In the standard score-first TTT pattern (PR #461/#549/#726), each chunk is scored BEFORE the model trains on it, and the score is final — no re-scoring after training. Here, scoring happens after training, which appears to be the adapt-then-score pattern that PR #518 was closed for. For reference, PR #518 was closed by @valerio-oai because it "trains on the validation set by reporting the score on a doc after its weights have adapted to it." Would you be able to clarify how this differs from the adapt-then-score pattern? |
Results
Method
10-layer transformer (512d, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2) with per-document batched LoRA test-time training.
LoRA rank-8 on Q/V projections + LM head. 64 documents batched in parallel. Per-doc reset, Adam lr=0.01, 256-token chunks, 3 epochs, score on final epoch. Mixed int5/int6 quantization + zstd-22.
See README.md for full details.