openai · User123331 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/1x H100 SXM5 Logs/10L_qk4_wd1500_20260401_094047.log b/1x H100 SXM5 Logs/10L_qk4_wd1500_20260401_094047.log
diff --git a/1x H100 SXM5 Logs/10L_qk4_wd2000_20260401_081938.log b/1x H100 SXM5 Logs/10L_qk4_wd2000_20260401_081938.log
diff --git a/1x H100 SXM5 Logs/11L_mlp35_20260401_074042.log b/1x H100 SXM5 Logs/11L_mlp35_20260401_074042.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_20260401_070337.log b/1x H100 SXM5 Logs/11L_qk4_20260401_070337.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_20260401_070342.log b/1x H100 SXM5 Logs/11L_qk4_20260401_070342.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1000_20260401_101532.log b/1x H100 SXM5 Logs/11L_qk4_wd1000_20260401_101532.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1000_tied005_20260401_154931.log b/1x H100 SXM5 Logs/11L_qk4_wd1000_tied005_20260401_154931.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1200_fa3_20260401_172016.log b/1x H100 SXM5 Logs/11L_qk4_wd1200_fa3_20260401_172016.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1200_fa3_20260401_183913.log b/1x H100 SXM5 Logs/11L_qk4_wd1200_fa3_20260401_183913.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_20260401_085426.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_20260401_085426.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_171958.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_171958.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_180210.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_180210.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_180428.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_fa3_20260401_180428.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_mlr025_20260401_105230.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_mlr025_20260401_105230.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_swa25_20260401_143529.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_swa25_20260401_143529.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_tied005_20260401_124435.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_tied005_20260401_124435.log
diff --git a/1x H100 SXM5 Logs/11L_qk4_wd1500_tied005_mlr025_20260401_151230.log b/1x H100 SXM5 Logs/11L_qk4_wd1500_tied005_mlr025_20260401_151230.log
diff --git a/1x H100 SXM5 Logs/11L_qk6_wd1500_20260401_132133.log b/1x H100 SXM5 Logs/11L_qk6_wd1500_20260401_132133.log
diff --git a/1x H100 SXM5 Logs/11L_qk8_wd1500_20260401_135834.log b/1x H100 SXM5 Logs/11L_qk8_wd1500_20260401_135834.log
diff --git a/1x H100 SXM5 Logs/11L_wd3500_20260401_081928.log b/1x H100 SXM5 Logs/11L_wd3500_20260401_081928.log
diff --git a/1x H100 SXM5 Logs/11layers_20260331_233309.log b/1x H100 SXM5 Logs/11layers_20260331_233309.log
diff --git a/1x H100 SXM5 Logs/12L_qk4_wd1200_20260401_120458.log b/1x H100 SXM5 Logs/12L_qk4_wd1200_20260401_120458.log
diff --git a/1x H100 SXM5 Logs/baseline_10L_20260331_214036.log b/1x H100 SXM5 Logs/baseline_10L_20260331_214036.log
diff --git a/1x H100 SXM5 Logs/baseline_786k_20260331_222116.log b/1x H100 SXM5 Logs/baseline_786k_20260331_222116.log
diff --git a/1x H100 SXM5 Logs/depth_recurrence.log b/1x H100 SXM5 Logs/depth_recurrence.log
diff --git a/1x H100 SXM5 Logs/free_wins.log b/1x H100 SXM5 Logs/free_wins.log
diff --git a/1x H100 SXM5 Logs/int6_qat.log b/1x H100 SXM5 Logs/int6_qat.log
diff --git a/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_162629.log b/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_162629.log
diff --git a/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_164231.log b/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_164231.log
diff --git a/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_173311.log b/1x H100 SXM5 Logs/mega_xsa_ema_fa3_20260401_173311.log
diff --git a/1x H100 SXM5 Logs/mlp35_20260401_033845.log b/1x H100 SXM5 Logs/mlp35_20260401_033845.log
diff --git a/1x H100 SXM5 Logs/naive_baseline_9L_mlp2_seq1024_20260401_113000.log b/1x H100 SXM5 Logs/naive_baseline_9L_mlp2_seq1024_20260401_113000.log
diff --git a/1x H100 SXM5 Logs/ngram_cache.log b/1x H100 SXM5 Logs/ngram_cache.log
diff --git a/1x H100 SXM5 Logs/parallel_residuals.log b/1x H100 SXM5 Logs/parallel_residuals.log
diff --git a/1x H100 SXM5 Logs/qk_gain_20260401_001427.log b/1x H100 SXM5 Logs/qk_gain_20260401_001427.log
diff --git a/1x H100 SXM5 Logs/score_first_ttt.log b/1x H100 SXM5 Logs/score_first_ttt.log
diff --git a/1x H100 SXM5 Logs/sp8192_full_stack.log b/1x H100 SXM5 Logs/sp8192_full_stack.log
diff --git a/1x H100 SXM5 Logs/warmdown3500_20260401_062854.log b/1x H100 SXM5 Logs/warmdown3500_20260401_062854.log
diff --git a/1x H100 SXM5 Logs/xsa_ema.log b/1x H100 SXM5 Logs/xsa_ema.log
diff --git a/Graphs/PLAN_beat_SOTA.md b/Graphs/PLAN_beat_SOTA.md
@@ -0,0 +1,156 @@
+# Plan: Beat SOTA (1.1428 bpb)
+
+**Date**: 2026-03-21
+**Current SOTA**: 1.1428 (thwu1, PR #180)
+**Emerging**: 1.1303 (PR #254), 1.1307 (PR #265) — not yet on leaderboard
+**Our target**: < 1.13 bpb
+
+---
+
+## Strategy: Combine proven techniques nobody has stacked together yet
+
+The key insight from analyzing all PRs: **no single submission combines ALL the best techniques**. Each top entry uses a subset. We stack them all.
+
+---
+
+## The Stack
+
+### Layer 1: Base Architecture (from thwu1 #180)
+- 10-11 layers, dim=512, 8 heads, 4 KV heads (GQA)
+- MLP 3x (hidden=1536), ReLU-squared
+- U-Net skip connections
+- Tied embeddings (FP16 passthrough)
+- Logit softcap=30
+
+### Layer 2: Quantization (from thwu1 #180)
+- Int5 for MLP weights (saves ~1.86MB for extra layer/features)
+- Int6 for attention weights
+- zstd-22 compression
+- 3% magnitude pruning post-training (better compression)
+- WD=0.04 for quantization robustness
+
+### Layer 3: Input Augmentation (from thwu1 #180 + #265)
+- BigramHash(10240) buckets, dim=128, projected to 512
+- SmearGate (proven compatible, +0.005-0.008)
+
+### Layer 4: Training Optimization (best of all PRs)
+- Muon: lr=0.02, WD=0.04, momentum warmup 0.92→0.99 over 1500 steps (from #265)
+- SWA: start_frac=0.4, every=50 steps (from thwu1)
+- OrthoInit + muP scaling
+- Warmdown=3000, warmup=20, grad_clip=0.3
+- Seq2048, batch=524K tokens (from #236 — more gradient updates)
+
+### Layer 5: Speed (from #265 + modded-nanogpt)
+- FlashAttention 3 (Hopper native) — ~5% faster steps
+- Fused Linear+ReLU^2 Triton kernel — ~10% MLP speedup
+- torch.compile mode="max-autotune"
+
+### Layer 6: Eval-Time (from #265 + #267)
+- Sliding window eval (stride=64)
+- Partial XSA on last 3 layers (from #265, +0.002 bpb, only 2ms/step)
+- Causal TTT: SGD on val chunks after scoring (from #267, +0.003 bpb)
+
+### Layer 7: Free Training Signal
+- MTP auxiliary head (predict t+2, t+3) — discarded at save, zero artifact cost
+- From PR #88 — provides gradient enrichment during training
+
+---
+
+## Expected Impact Breakdown
+
+| Technique | bpb gain over baseline | Source |
+|-----------|----------------------|--------|
+| Int5/6 + MLP3x + 10L | ~0.08 | thwu1 baseline |
+| BigramHash(10240) | ~0.01 | thwu1 |
+| SmearGate | ~0.006 | PR #162 |
+| SWA | ~0.005 | thwu1 |
+| OrthoInit + muP | ~0.004 | PR #198 |
+| Sliding Window | ~0.03 | All top PRs |
+| Seq2048 | ~0.015 | PR #198 |
+| Smaller batch (524K) | ~0.003 | PR #236 |
+| FA3 + fused kernels (more steps) | ~0.005 | PR #265 |
+| Partial XSA (last 3 layers) | ~0.002 | PR #265 |
+| Causal TTT | ~0.003 | PR #267 |
+| MTP auxiliary | ~0.002 | PR #88 |
+| **Total from 1.2244 baseline** | **~0.165** | |
+| **Projected bpb** | **~1.06-1.10** | |
+
+Conservative estimate: **1.10-1.12 bpb** (not everything stacks perfectly).
+
+---
+
+## Implementation Phases
+
+### Phase 1: Fork SOTA code (~2 hours)
+- Take thwu1's train_gpt.py from PR #180 as base
+- Verify it reproduces 1.1428 on 8xH100 (10 min run, ~$3)
+- This becomes our baseline to improve upon
+
+### Phase 2: Add proven extras (~3 hours)
+- Add SmearGate (if not already in thwu1's code)
+- Add Muon momentum warmup (0.92→0.99)
+- Switch to batch=524K
+- Add FlashAttention 3
+- Test on 1xH100 for quick validation
+
+### Phase 3: Add novel techniques (~4 hours)
+- Implement Partial XSA on last 3 layers (from PR #265)
+- Add MTP auxiliary head (from PR #88)
+- Add fused Triton kernels (Linear+ReLU^2, softcapped CE)
+- Test on 1xH100
+
+### Phase 4: Eval-time optimization (~2 hours)
+- Implement Causal TTT (SGD, 3 epochs per chunk)
+- Tune TTT hyperparameters (lr, momentum, epochs)
+- Test on 1xH100
+
+### Phase 5: Record attempt (~$20)
+- Full run on 8xH100, 10 min
+- Submit to record track
+- If < 1.13 → PR to openai/parameter-golf
+
+---
+
+## Compute Budget
+
+| Phase | Hardware | Time | Cost |
+|-------|----------|------|------|
+| Phase 1 | 8xH100 | 15 min | ~$5 |
+| Phase 2 | 1xH100 | 30 min | ~$2 |
+| Phase 3 | 1xH100 | 1 hour | ~$4 |
+| Phase 4 | 1xH100 | 30 min | ~$2 |
+| Phase 5 | 8xH100 | 15 min | ~$5 |
+| Buffer | — | — | ~$5 |
+| **Total** | | | **~$23** |
+
+---
+
+## What Makes This Novel
+
+Nobody has combined ALL of these:
+1. Int5/Int6 mixed quant + 10-11L (thwu1)
+2. + Partial XSA (PR #265, brand new technique)
+3. + MTP auxiliary training (PR #88, free signal)
+4. + Causal TTT (PR #267)
+5. + FA3 + fused Triton kernels (modded-nanogpt)
+6. + Optimized batch size (PR #236)
+
+Each top PR uses 3-4 of these. We use all 6+.
+
+---
+
+## Risk Assessment
+
+| Risk | Mitigation |
+|------|-----------|
+| Techniques don't stack as expected | Phase-by-phase testing on 1xH100 |
+| XSA + TTT conflict | Test independently first |
+| Int5 fragile with new techniques | Fall back to Int6 if quant degrades |
+| Compute budget overrun | 1xH100 validation before 8xH100 record |
+| FA3 install issues on RunPod | FA3 may already be in the template; fall back to FA2 |
+
+---
+
+## Immediate Next Step
+
+Pull thwu1's code from PR #180 and start Phase 1.
diff --git a/Graphs/speed_optimizations.md b/Graphs/speed_optimizations.md
@@ -0,0 +1,99 @@
+# Speed Optimizations: Triton Kernels & Libraries
+
+**Goal**: More training steps in the same wallclock = better bpb
+
+---
+
+## Priority 1: FlashAttention 3 (~5% step time reduction)
+
+**What**: H100-optimized attention using Hopper async TMA + warp specialization
+**Speedup**: 1.5-2x over FA2 in attention forward, ~5% overall step time
+**Integration**: Drop-in replacement
+```python
+from flash_attn_interface import flash_attn_func as flash_attn_3_func
+```
+**Status**: Proven — PRs #198 and #164 use this. Only external library in top submissions.
+**Install**: `pip install flash-attn --no-build-isolation` (from hopper branch)
+
+---
+
+## Priority 2: Fused Linear+ReLU^2 Triton Kernel (~5-15% MLP speedup)
+
+**What**: Fuses CastedLinear + relu().square() into one Triton kernel
+**Source**: modded-nanogpt `triton_kernels.FusedLinearReLUSquareFunction`
+**Why it helps**: Eliminates intermediate tensor materialization in MLP (which is 3x expanded)
+**Integration**: Copy Triton kernel, replace MLP forward pass
+**Status**: Used in modded-nanogpt speedrun, not yet in any Parameter Golf PR
+
+---
+
+## Priority 3: Fused Softcapped Cross-Entropy (~2-5% loss speedup)
+
+**What**: Fuses logit_softcap + cross_entropy into one Triton kernel
+**Source**: modded-nanogpt `triton_kernels.FusedSoftcappedCrossEntropy`
+**Why it helps**: Avoids materializing softcapped logits tensor
+**Integration**: Copy Triton kernel, replace loss computation
+**Note**: Only applies to non-MoS path (MoS uses nll_loss on log-probs)
+**Status**: Used in modded-nanogpt speedrun, not yet in any Parameter Golf PR
+
+---
+
+## Priority 4: torch.compile Tuning (0-5% overall)
+
+```python
+# Current
+torch.compile(model, dynamic=False, fullgraph=True)
+
+# Try
+torch.compile(model, dynamic=False, fullgraph=True, mode="max-autotune")
+```
+
+Also set env var:
+```bash
+export PYTORCH_ALLOC_CONF="expandable_segments:True"
+```
+
+---
+
+## Priority 5: Gradient Checkpointing (enables larger batch/seq)
+
+**What**: Recompute activations in backward pass instead of storing them
+**Benefit**: 50-70% activation memory reduction, enables seq=2048 or larger batch on 1xH100
+**Cost**: ~20-33% more compute (5-10% wall-clock in practice)
+**When to use**: If moving to seq=2048+ on 1xH100
+
+---
+
+## Priority 6: Custom Triton MoS Kernel (if MoS proves useful)
+
+**What**: Fuse log_softmax over K components + logsumexp mixture into one kernel
+**Expected**: Reduce MoS overhead from ~5ms to ~2-3ms per step
+**Effort**: ~50-100 lines of Triton, based on fused softmax tutorial
+**Note**: The bigger bottleneck is the K einsum matmuls, not the softmax
+
+---
+
+## NOT Worth It at Our Scale
+
+| Technique | Why Skip |
+|-----------|----------|
+| FP8 training (torchao) | dim=512 matrices too small, overhead > benefit |
+| Fused RMSNorm | torch.compile already fuses it |
+| Apex FusedAdam | Already using fused=True, marginal gain |
+| Liger FusedCE | Logit tensor tiny at vocab=1024 |
+| bitsandbytes 8-bit optimizer | Model too small to benefit |
+
+---
+
+## Impact Estimate
+
+| Optimization | Step Time Reduction | Extra Steps in 10min | bpb Impact |
+|-------------|--------------------|--------------------|------------|
+| FA3 | ~5% | +1000 steps | ~0.005 bpb |
+| Fused MLP | ~10% | +2000 steps | ~0.008 bpb |
+| Fused CE | ~3% | +600 steps | ~0.002 bpb |
+| max-autotune | ~2% | +400 steps | ~0.001 bpb |
+| **Combined** | **~20%** | **+4000 steps** | **~0.015 bpb** |
+
+At current ~500ms/step, 20% reduction = 400ms/step = ~1500 steps in 10min → ~1875 steps.
+On 8xH100 at ~27ms/step, 20% = ~22ms/step = ~27,300 steps vs ~22,200.