Skip to content

Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)#1148

Open
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-muon-ttt-entropy-adaptive-v2
Open

Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)#1148
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-muon-ttt-entropy-adaptive-v2

Conversation

@aamodbhatt
Copy link
Copy Markdown

Summary

Two novel TTT innovations on the SOTA base stack (PR #399 + PR #414 + PR #461): Muon-style Newton-Schulz orthogonalized updates replace SGD in the TTT loop, and entropy-adaptive epoch selection concentrates adaptation budget on harder content. Beats current SOTA (1.1194) with a 3-seed mean of 1.1179.

Run Results (3 seeds)

Seed legal_ttt_exact val_bpb legal_ttt_exact val_loss pre-quant val_bpb train time eval time (TTT) artifact size
1337 1.11765030 1.88710072 1.1366 599.1s 477.9s 15,944,410 bytes
42 1.11812929 1.88790947 1.1371 599.1s 485.3s 15,873,826 bytes
2025 1.11789934 1.88752121 1.1367 599.1s 479.2s 15,879,042 bytes
mean 1.11789

Method Notes

  • NUM_LAYERS=11, BIGRAM_VOCAB_SIZE=1536, XSA_LAST_N=4
  • TTT_ENABLED=1, score-first path
  • TTT_MUON=1 — Newton-Schulz orthogonalized updates in TTT loop (NS steps=3)
  • TTT_ENTROPY_ADAPT=1 — entropy-adaptive 2/3/4 epochs per chunk (H_HIGH=2.1, H_LOW=1.75)
  • TTT_LR=0.002, TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768
  • NGRAM_EVAL_ENABLED=0
  • NGRAM_TWO_PASS_ENABLED=0
  • NGRAM_FULL_RESCORE=0
  • EMA_ENABLED=1, SWA_ENABLED=1, LATE_QAT=1, VE_ENABLED=1
  • WARMDOWN_ITERS=3500, MAX_WALLCLOCK_SECONDS=599

Submission Checklist

  • One folder under records/track_10min_16mb/
  • Included README.md, submission.json, train_gpt.py, and train logs (3 seeds)
  • Training <= 600s
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No tokenizer/dataset modifications
  • Score-first TTT (SCORE under inference_mode before TRAIN on same chunk)
  • No n-gram, no two-pass, no external data lookup

…179 (3-seed mean)

Two novel TTT innovations: (1) Muon-style Newton-Schulz orthogonalized updates
replace SGD in the TTT loop; (2) entropy-adaptive 2/3/4 epochs per chunk based
on globally-synced chunk NLL. 3-seed mean 1.1179, std 0.0002. All under 16MB/600s.
@aamodbhatt aamodbhatt changed the title Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) Record: 11L Muon TTT + Legal Score-First TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) Mar 30, 2026
@aamodbhatt aamodbhatt changed the title Record: 11L Muon TTT + Legal Score-First TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant