openai
diff --git a/‎records/track_10min_16mb/2026-03-22_LegalTTT_ParallelMuon_1.1213/README.md‎
Lines changed: 74 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-22_LegalTTT_ParallelMuon_1.1213/README.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎records/track_10min_16mb/2026-03-22_LegalTTT_ParallelMuon_1.1213/submission.json‎
Lines changed: 10 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-22_LegalTTT_ParallelMuon_1.1213/submission.json‎
Lines changed: 10 additions & 0 deletions
@@ -0,0 +1,74 @@
+# Legal Score-First TTT + Parallel Muon + Parameter Banking
+
+**val_bpb: 1.1218** (legal TTT, 2-seed mean; 3rd seed in progress) | **~15.8 MB** | 8×H100 SXM, 600s training + 400s TTT eval
+
+## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
+
+| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | Artifact |
+|------|----------|-------|-------------|-----------------|----------|----------|
+| 1337 | 82.3ms | 7,278 | 1.1234 | **1.1213** | -0.0021 | 15,841,722 |
+| 42 | 82.4ms | 7,265 | 1.1242 | **1.1222** | -0.0020 | pending |
+| 2025 | running | | | | | |
+| **Mean** | **82.3ms** | **~7,272** | **1.1238** | **1.1218** | **-0.0021** | **~15.8 MB** |
+
+## Legal TTT Protocol (from PR #461)
+
+Every validation token is **scored BEFORE any weight update** that could use it:
+
+```
+for each 32K-token chunk of val data:
+    Phase 1 — SCORE: sliding window eval under torch.inference_mode()
+              Record per-token NLL. This is the official score.
+    Phase 2 — TRAIN: SGD(lr=0.002, momentum=0.9) for 3 epochs
+              Freeze first 2 blocks. Grad clip 1.0. Cosine LR decay.
+              Model adapts, improving predictions for FUTURE chunks only.
+```
+
+Scoring under `inference_mode()` guarantees no gradient computation or weight mutation during scoring. The chunk ordering ensures strict causal legality.
+
+### TTT Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| Chunk size | 32,768 tokens |
+| Optimizer | SGD + momentum(0.9) |
+| Learning rate | 0.002 (cosine decay across chunks) |
+| Epochs per chunk | 3 |
+| Frozen blocks | First 2 of 11 |
+| Gradient clip | 1.0 |
+
+## Training Architecture
+
+Built on PR #414's stack with Parameter Banking + Parallel Muon optimizer:
+
+- 11L, 512d, 8H/4KV, MLP 3× (relu²)
+- XSA on last 4 layers, Partial RoPE (16/64 dims), LN Scale
+- SmearGate, BigramHash(2048), VE128 on layers 9-10
+- EMA(0.997) + Tight SWA(every 50)
+- GPTQ-lite int6 quantization + lzma compression
+- **Parameter Banking**: 4 contiguous 3D banks replace 66 nn.Linear weights
+- **Parallel Muon**: No DDP for banks. Post-backward reduce-scatter → local NS → all-gather
+
+## Run Command
+
+```bash
+NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
+EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
+ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
+VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
+TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
+TTT_FREEZE_BLOCKS=2 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
+MUON_WD=0.04 ADAM_WD=0.04 \
+MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
+ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
+SEED=1337 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **TTT recipe**: PR #461 by @anantdgoel — legal score-first TTT with SGD+momentum, selective freezing
+- **Base model**: PR #414 by @signalrush — GPTQ-lite, VE128, Tight SWA, warmdown=3500
+- **Optimizer**: Parameter Banking + Parallel Muon (arXiv:2511.07464)
@@ -0,0 +1,10 @@
+{
+  "name": "Legal Score-First TTT + Parallel Muon + Parameter Banking (val_bpb=1.1218, 2-seed mean)",
+  "val_loss": 1.8940,
+  "val_bpb": 1.1218,
+  "bytes_total": 15841722,
+  "blurb": "Legal score-first TTT (PR #461 recipe: SGD+momentum, 3 epochs/chunk, freeze 2 blocks, 32K chunks) on #414 stack with Parameter Banking + Parallel Muon (PR #399). Every token scored BEFORE model adapts, enforced by inference_mode. 2-seed mean: pre-TTT 1.1238, post-TTT 1.1218 (-0.0020). TTT eval: ~400s. Built on PR #414 by @signalrush, TTT from PR #461 by @anantdgoel, optimizer from PR #399 by @abaybektursun.",
+  "author": "abaybektursun",
+  "github_id": "abaybektursun",
+  "date": "2026-03-22"
+}