Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
5d3c230
Add Mixture of Softmax (MoS) with low-rank option for softmax bottleneck
Mar 20, 2026
9d61c17
Add one-shot RunPod setup and pilot run script
Mar 20, 2026
c8de91e
Simplify pilot script: MoS only, skip redundant baseline
Mar 20, 2026
68520fe
Fix setup script: remove clone step, assume already in repo
Mar 20, 2026
01f071a
Add HF_TOKEN to setup script for faster dataset downloads
Mar 20, 2026
183dfa3
Record: MoS K=2 R=64 pilot — val_bpb=1.3932 (1xH100, 10min)
Mar 20, 2026
2440ea8
Add 1-hour MoS validation script (targeting PR#111 baseline)
Mar 20, 2026
0773ee0
Make 1h script survive terminal disconnects via nohup
User123331 Mar 20, 2026
69d4a7d
Add vanilla baseline 10-min script for 1xH100 comparison
User123331 Mar 20, 2026
0a32f9d
Add research artifacts: technique encyclopedia, combination matrix, s…
User123331 Mar 20, 2026
fd4f106
Integrate SOTA stack (thwu1's 1.1428 bpb) + custom tokenizer pipeline
User123331 Mar 21, 2026
34ff363
Add train_tokenizer_only.sh for focused tokenizer training
User123331 Mar 21, 2026
db62d70
Add RunPod SOTA launcher
Mar 22, 2026
32c694d
Add MoS + SOTA technique stack for competitive testing
Mar 22, 2026
5a0fa0b
Run training with nohup to survive terminal disconnects
Mar 22, 2026
a5a6391
Fix FA3 build: use --no-build-isolation so setup.py can find torch
Mar 22, 2026
73ba4f7
Add keep-alive heartbeat to prevent RunPod pod termination
Mar 22, 2026
34519d8
Fix FA3 build: clear stale build dir, fix variable scoping
Mar 22, 2026
b2e9c10
Add sentencepiece and numpy to deps check
Mar 22, 2026
d7aa8c4
Fix FA3 install: mkdir flash_attn_3 before pip editable install
Mar 22, 2026
6003c55
Add hyperbolic.ai setup scripts for 8x H100
Mar 23, 2026
b2f4d3e
Update quickstart to use pre-compiled FA3 .so
Mar 23, 2026
b34ba1e
Fix data paths for hyperbolic setup
Mar 23, 2026
9ba2ec4
Fix: use ~/golf instead of /workspace for hyperbolic
Mar 23, 2026
a77f8a7
Fix: use $HOME instead of /workspace for FA3 build
Mar 23, 2026
057a844
Add --break-system-packages for externally-managed environments
Mar 23, 2026
c88f9a2
Fix: use SCRIPT_DIR instead of hardcoded golf path
Mar 23, 2026
bd347d9
Fix: use user site-packages instead of system
Mar 23, 2026
61a9b21
Build FA3 from source (pre-compiled .so not in repo)
Mar 23, 2026
5ebbc36
Add DISABLE_COMPILE option to fix torch.compile/inductor issues
Mar 23, 2026
43c0e5a
Remove nohup wait - use tmux for persistence instead
Mar 23, 2026
fedf2e2
Add 10L_qk4_wd1500_20260401_094047.log
User123331 Apr 14, 2026
bb203ce
Add 10L_qk4_wd2000_20260401_081938.log
User123331 Apr 14, 2026
5f8de23
Add 11L_mlp35_20260401_074042.log
User123331 Apr 14, 2026
b74224d
Add 11L_qk4_20260401_070337.log
User123331 Apr 14, 2026
7399518
Add 11L_qk4_20260401_070342.log
User123331 Apr 14, 2026
2dc88de
Add 11L_qk4_wd1000_20260401_101532.log
User123331 Apr 14, 2026
ba55b12
Add 11L_qk4_wd1000_tied005_20260401_154931.log
User123331 Apr 14, 2026
699e579
Add 11L_qk4_wd1200_fa3_20260401_172016.log
User123331 Apr 14, 2026
9e9c2bb
Add 11L_qk4_wd1200_fa3_20260401_183913.log
User123331 Apr 14, 2026
fe152d3
Add 11L_qk4_wd1500_20260401_085426.log
User123331 Apr 14, 2026
3bf9631
Add 11L_qk4_wd1500_fa3_20260401_171958.log
User123331 Apr 14, 2026
3920103
Add 11L_qk4_wd1500_fa3_20260401_180210.log
User123331 Apr 14, 2026
33d39ce
Add 11L_qk4_wd1500_fa3_20260401_180428.log
User123331 Apr 14, 2026
8649391
Add 11L_qk4_wd1500_mlr025_20260401_105230.log
User123331 Apr 14, 2026
a3fd722
Add 11L_qk4_wd1500_swa25_20260401_143529.log
User123331 Apr 14, 2026
90013e5
Add 11L_qk4_wd1500_tied005_20260401_124435.log
User123331 Apr 14, 2026
5c6c028
Add 11L_qk4_wd1500_tied005_mlr025_20260401_151230.log
User123331 Apr 14, 2026
0304084
Add 11L_qk6_wd1500_20260401_132133.log
User123331 Apr 14, 2026
128da10
Add 11L_qk8_wd1500_20260401_135834.log
User123331 Apr 14, 2026
4f382e7
Add 11L_wd3500_20260401_081928.log
User123331 Apr 14, 2026
5cced43
Add 11layers_20260331_233309.log
User123331 Apr 14, 2026
00e7491
Add 12L_qk4_wd1200_20260401_120458.log
User123331 Apr 14, 2026
a37adf3
Add baseline_10L_20260331_214036.log
User123331 Apr 14, 2026
9b04933
Add baseline_786k_20260331_222116.log
User123331 Apr 14, 2026
7e95406
Add depth_recurrence.log
User123331 Apr 14, 2026
cd14860
Add free_wins.log
User123331 Apr 14, 2026
f6d9ee3
Add int6_qat.log
User123331 Apr 14, 2026
2f899a6
Add mega_xsa_ema_fa3_20260401_162629.log
User123331 Apr 14, 2026
488ac38
Add mega_xsa_ema_fa3_20260401_164231.log
User123331 Apr 14, 2026
1a5a1ba
Add mega_xsa_ema_fa3_20260401_173311.log
User123331 Apr 14, 2026
05171ce
Add mlp35_20260401_033845.log
User123331 Apr 14, 2026
02dcc3e
Add naive_baseline_9L_mlp2_seq1024_20260401_113000.log
User123331 Apr 14, 2026
e1c2993
Add ngram_cache.log
User123331 Apr 14, 2026
14885fd
Add parallel_residuals.log
User123331 Apr 14, 2026
4d9e4e2
Add score_first_ttt.log
User123331 Apr 14, 2026
9c1c23f
Add sp8192_full_stack.log
User123331 Apr 14, 2026
2476ce7
Add warmdown3500_20260401_062854.log
User123331 Apr 14, 2026
29b5b20
Add xsa_ema.log
User123331 Apr 14, 2026
b26e41b
Add qk_gain_20260401_001427.log
User123331 Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file added 1x H100 SXM5 Logs/free_wins.log
Empty file.
Empty file added 1x H100 SXM5 Logs/int6_qat.log
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
709 changes: 709 additions & 0 deletions 1x H100 SXM5 Logs/qk_gain_20260401_001427.log

Large diffs are not rendered by default.

Empty file.
Empty file.
Empty file.
Empty file added 1x H100 SXM5 Logs/xsa_ema.log
Empty file.
156 changes: 156 additions & 0 deletions Graphs/PLAN_beat_SOTA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Plan: Beat SOTA (1.1428 bpb)

**Date**: 2026-03-21
**Current SOTA**: 1.1428 (thwu1, PR #180)
**Emerging**: 1.1303 (PR #254), 1.1307 (PR #265) — not yet on leaderboard
**Our target**: < 1.13 bpb

---

## Strategy: Combine proven techniques nobody has stacked together yet

The key insight from analyzing all PRs: **no single submission combines ALL the best techniques**. Each top entry uses a subset. We stack them all.

---

## The Stack

### Layer 1: Base Architecture (from thwu1 #180)
- 10-11 layers, dim=512, 8 heads, 4 KV heads (GQA)
- MLP 3x (hidden=1536), ReLU-squared
- U-Net skip connections
- Tied embeddings (FP16 passthrough)
- Logit softcap=30

### Layer 2: Quantization (from thwu1 #180)
- Int5 for MLP weights (saves ~1.86MB for extra layer/features)
- Int6 for attention weights
- zstd-22 compression
- 3% magnitude pruning post-training (better compression)
- WD=0.04 for quantization robustness

### Layer 3: Input Augmentation (from thwu1 #180 + #265)
- BigramHash(10240) buckets, dim=128, projected to 512
- SmearGate (proven compatible, +0.005-0.008)

### Layer 4: Training Optimization (best of all PRs)
- Muon: lr=0.02, WD=0.04, momentum warmup 0.92→0.99 over 1500 steps (from #265)
- SWA: start_frac=0.4, every=50 steps (from thwu1)
- OrthoInit + muP scaling
- Warmdown=3000, warmup=20, grad_clip=0.3
- Seq2048, batch=524K tokens (from #236 — more gradient updates)

### Layer 5: Speed (from #265 + modded-nanogpt)
- FlashAttention 3 (Hopper native) — ~5% faster steps
- Fused Linear+ReLU^2 Triton kernel — ~10% MLP speedup
- torch.compile mode="max-autotune"

### Layer 6: Eval-Time (from #265 + #267)
- Sliding window eval (stride=64)
- Partial XSA on last 3 layers (from #265, +0.002 bpb, only 2ms/step)
- Causal TTT: SGD on val chunks after scoring (from #267, +0.003 bpb)

### Layer 7: Free Training Signal
- MTP auxiliary head (predict t+2, t+3) — discarded at save, zero artifact cost
- From PR #88 — provides gradient enrichment during training

---

## Expected Impact Breakdown

| Technique | bpb gain over baseline | Source |
|-----------|----------------------|--------|
| Int5/6 + MLP3x + 10L | ~0.08 | thwu1 baseline |
| BigramHash(10240) | ~0.01 | thwu1 |
| SmearGate | ~0.006 | PR #162 |
| SWA | ~0.005 | thwu1 |
| OrthoInit + muP | ~0.004 | PR #198 |
| Sliding Window | ~0.03 | All top PRs |
| Seq2048 | ~0.015 | PR #198 |
| Smaller batch (524K) | ~0.003 | PR #236 |
| FA3 + fused kernels (more steps) | ~0.005 | PR #265 |
| Partial XSA (last 3 layers) | ~0.002 | PR #265 |
| Causal TTT | ~0.003 | PR #267 |
| MTP auxiliary | ~0.002 | PR #88 |
| **Total from 1.2244 baseline** | **~0.165** | |
| **Projected bpb** | **~1.06-1.10** | |

Conservative estimate: **1.10-1.12 bpb** (not everything stacks perfectly).

---

## Implementation Phases

### Phase 1: Fork SOTA code (~2 hours)
- Take thwu1's train_gpt.py from PR #180 as base
- Verify it reproduces 1.1428 on 8xH100 (10 min run, ~$3)
- This becomes our baseline to improve upon

### Phase 2: Add proven extras (~3 hours)
- Add SmearGate (if not already in thwu1's code)
- Add Muon momentum warmup (0.92→0.99)
- Switch to batch=524K
- Add FlashAttention 3
- Test on 1xH100 for quick validation

### Phase 3: Add novel techniques (~4 hours)
- Implement Partial XSA on last 3 layers (from PR #265)
- Add MTP auxiliary head (from PR #88)
- Add fused Triton kernels (Linear+ReLU^2, softcapped CE)
- Test on 1xH100

### Phase 4: Eval-time optimization (~2 hours)
- Implement Causal TTT (SGD, 3 epochs per chunk)
- Tune TTT hyperparameters (lr, momentum, epochs)
- Test on 1xH100

### Phase 5: Record attempt (~$20)
- Full run on 8xH100, 10 min
- Submit to record track
- If < 1.13 → PR to openai/parameter-golf

---

## Compute Budget

| Phase | Hardware | Time | Cost |
|-------|----------|------|------|
| Phase 1 | 8xH100 | 15 min | ~$5 |
| Phase 2 | 1xH100 | 30 min | ~$2 |
| Phase 3 | 1xH100 | 1 hour | ~$4 |
| Phase 4 | 1xH100 | 30 min | ~$2 |
| Phase 5 | 8xH100 | 15 min | ~$5 |
| Buffer | — | — | ~$5 |
| **Total** | | | **~$23** |

---

## What Makes This Novel

Nobody has combined ALL of these:
1. Int5/Int6 mixed quant + 10-11L (thwu1)
2. + Partial XSA (PR #265, brand new technique)
3. + MTP auxiliary training (PR #88, free signal)
4. + Causal TTT (PR #267)
5. + FA3 + fused Triton kernels (modded-nanogpt)
6. + Optimized batch size (PR #236)

Each top PR uses 3-4 of these. We use all 6+.

---

## Risk Assessment

| Risk | Mitigation |
|------|-----------|
| Techniques don't stack as expected | Phase-by-phase testing on 1xH100 |
| XSA + TTT conflict | Test independently first |
| Int5 fragile with new techniques | Fall back to Int6 if quant degrades |
| Compute budget overrun | 1xH100 validation before 8xH100 record |
| FA3 install issues on RunPod | FA3 may already be in the template; fall back to FA2 |

---

## Immediate Next Step

Pull thwu1's code from PR #180 and start Phase 1.
99 changes: 99 additions & 0 deletions Graphs/speed_optimizations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Speed Optimizations: Triton Kernels & Libraries

**Goal**: More training steps in the same wallclock = better bpb

---

## Priority 1: FlashAttention 3 (~5% step time reduction)

**What**: H100-optimized attention using Hopper async TMA + warp specialization
**Speedup**: 1.5-2x over FA2 in attention forward, ~5% overall step time
**Integration**: Drop-in replacement
```python
from flash_attn_interface import flash_attn_func as flash_attn_3_func
```
**Status**: Proven — PRs #198 and #164 use this. Only external library in top submissions.
**Install**: `pip install flash-attn --no-build-isolation` (from hopper branch)

---

## Priority 2: Fused Linear+ReLU^2 Triton Kernel (~5-15% MLP speedup)

**What**: Fuses CastedLinear + relu().square() into one Triton kernel
**Source**: modded-nanogpt `triton_kernels.FusedLinearReLUSquareFunction`
**Why it helps**: Eliminates intermediate tensor materialization in MLP (which is 3x expanded)
**Integration**: Copy Triton kernel, replace MLP forward pass
**Status**: Used in modded-nanogpt speedrun, not yet in any Parameter Golf PR

---

## Priority 3: Fused Softcapped Cross-Entropy (~2-5% loss speedup)

**What**: Fuses logit_softcap + cross_entropy into one Triton kernel
**Source**: modded-nanogpt `triton_kernels.FusedSoftcappedCrossEntropy`
**Why it helps**: Avoids materializing softcapped logits tensor
**Integration**: Copy Triton kernel, replace loss computation
**Note**: Only applies to non-MoS path (MoS uses nll_loss on log-probs)
**Status**: Used in modded-nanogpt speedrun, not yet in any Parameter Golf PR

---

## Priority 4: torch.compile Tuning (0-5% overall)

```python
# Current
torch.compile(model, dynamic=False, fullgraph=True)

# Try
torch.compile(model, dynamic=False, fullgraph=True, mode="max-autotune")
```

Also set env var:
```bash
export PYTORCH_ALLOC_CONF="expandable_segments:True"
```

---

## Priority 5: Gradient Checkpointing (enables larger batch/seq)

**What**: Recompute activations in backward pass instead of storing them
**Benefit**: 50-70% activation memory reduction, enables seq=2048 or larger batch on 1xH100
**Cost**: ~20-33% more compute (5-10% wall-clock in practice)
**When to use**: If moving to seq=2048+ on 1xH100

---

## Priority 6: Custom Triton MoS Kernel (if MoS proves useful)

**What**: Fuse log_softmax over K components + logsumexp mixture into one kernel
**Expected**: Reduce MoS overhead from ~5ms to ~2-3ms per step
**Effort**: ~50-100 lines of Triton, based on fused softmax tutorial
**Note**: The bigger bottleneck is the K einsum matmuls, not the softmax

---

## NOT Worth It at Our Scale

| Technique | Why Skip |
|-----------|----------|
| FP8 training (torchao) | dim=512 matrices too small, overhead > benefit |
| Fused RMSNorm | torch.compile already fuses it |
| Apex FusedAdam | Already using fused=True, marginal gain |
| Liger FusedCE | Logit tensor tiny at vocab=1024 |
| bitsandbytes 8-bit optimizer | Model too small to benefit |

---

## Impact Estimate

| Optimization | Step Time Reduction | Extra Steps in 10min | bpb Impact |
|-------------|--------------------|--------------------|------------|
| FA3 | ~5% | +1000 steps | ~0.005 bpb |
| Fused MLP | ~10% | +2000 steps | ~0.008 bpb |
| Fused CE | ~3% | +600 steps | ~0.002 bpb |
| max-autotune | ~2% | +400 steps | ~0.001 bpb |
| **Combined** | **~20%** | **+4000 steps** | **~0.015 bpb** |

At current ~500ms/step, 20% reduction = 400ms/step = ~1500 steps in 10min → ~1875 steps.
On 8xH100 at ~27ms/step, 20% = ~22ms/step = ~27,300 steps vs ~22,200.
Loading