Skip to content

Commit 4fb6969

Browse files
committed
Replace LoRA TTT with 30ep cosine full-model TTT in 16L XSA-all submission
Swap score-first LoRA TTT for the simpler and more effective cosine TTT approach from PR openai#672 (1.0781 BPB): fine-tune all model weights on val data for 30 epochs with cosine LR decay and per-layer LR groups (3x MLP-out, 0.5x MLP-in), followed by sliding-window stride=64 eval.
1 parent 5825338 commit 4fb6969

2 files changed

Lines changed: 103 additions & 239 deletions

File tree

records/track_10min_16mb/2026-03-25_16L_XSAall_GPTQ_EMA_PartialRoPE_TTT/submission.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"name": "Bharath",
33
"github": "Bharath-970",
44
"val_bpb": null,
5-
"notes": "16L + XSA-all (all layers share single KV set from layer 0) + Int4 nibble MLP QAT + Int6 Attn + GPTQ-lite + EMA decay=0.999 + Partial RoPE 25% + Bigram20480 + Trigram10240 + LeakyReLU(0.5)^2 + Score-First TTT LoRA (rank=8) + warmdown5000. KV savings from XSA-all fund 2 extra layers vs XSA6. Pending training run on 8xH100.",
5+
"notes": "16L + XSA-all (all layers share single KV set from layer 0) + Int4 nibble MLP QAT + Int6 Attn + GPTQ-lite + EMA decay=0.999 + Partial RoPE 25% + Bigram20480 + Trigram10240 + LeakyReLU(0.5)^2 + Cosine TTT 30ep (full-model AdamW on val, per-layer LR: 3x MLP-out, 0.5x MLP-in) + sliding-window stride=64 eval + warmdown5000. KV savings from XSA-all fund 2 extra layers vs XSA6. Pending training run on 8xH100.",
66
"techniques": [
77
"int4_nibble_mlp",
88
"qat_ste",
@@ -13,8 +13,8 @@
1313
"bigram_hash_20480",
1414
"trigram_hash_10240",
1515
"leaky_relu_squared",
16-
"ttt_lora_rank8",
17-
"score_first_ttt",
16+
"cosine_ttt_30ep",
17+
"sliding_window_eval_stride64",
1818
"smeargate",
1919
"muon_weight_decay",
2020
"u_net_skip"

0 commit comments

Comments
 (0)