|
| 1 | +# Experiment 4: Depth Recurrence — Design Spec |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Beat merged SOTA (1.1147 BPB) using depth recurrence. Target: ~1.09 BPB. |
| 6 | + |
| 7 | +## Strategy |
| 8 | + |
| 9 | +Adopt PR #1421's proven script (1.0925 BPB, 3-seed mean) as our base. Optionally add BigramHash. This is low-risk because the script is already validated at competition scale. |
| 10 | + |
| 11 | +## Why PR #1421 Over Porting Recurrence Into Our Script |
| 12 | + |
| 13 | +Our SP1024 SOTA (1.1147) is already 0.022 BPB behind PR #1421 (1.0925). The gap comes from 6+ independent improvements (SP4096, MuonEq-R, skip gates, parallel residuals, QK-Gain 5.0, WD 0.09, EMA 0.9965) — not just recurrence. Porting all of these into our parameter-bank architecture would be high effort and high risk. Using the proven script directly is the pragmatic choice. |
| 14 | + |
| 15 | +## Architecture (PR #1421, verbatim) |
| 16 | + |
| 17 | +- 11 physical layers, 512d, 8 heads, 4 KV heads (GQA) |
| 18 | +- Depth recurrence: layers 4,5 repeat once (13 virtual layers: `[0,1,2,3,4,5,4,5,6,7,8,9,10]`) |
| 19 | +- Recurrence activates at step 3000 (~55% through training) |
| 20 | +- Skip gates: learnable sigmoid gating on U-Net skip connections |
| 21 | +- Parallel residuals: layers 7+ run attention and MLP in parallel lanes, merged via learnable scalar |
| 22 | +- SP4096 tokenizer (SentencePiece 4096 BPE) |
| 23 | +- MuonEq-R: row normalization before Newton-Schulz orthogonalization |
| 24 | +- Value Embedding: dim=128, layers 9,10 |
| 25 | +- QK-Gain: learnable per-head Q scaling, init=5.0 |
| 26 | +- Tied embeddings, logit softcap=30.0, partial RoPE (16/64 dims) |
| 27 | +- XSA on all 11 layers |
| 28 | + |
| 29 | +## Training Hyperparameters |
| 30 | + |
| 31 | +| Parameter | Value | |
| 32 | +|-----------|-------| |
| 33 | +| Muon LR | 0.02 | |
| 34 | +| Muon momentum | 0.99 | |
| 35 | +| Muon WD | 0.09 | |
| 36 | +| Muon backend steps | 5 | |
| 37 | +| Embed LR | 0.6 | |
| 38 | +| Embed WD | 0.09 | |
| 39 | +| Head LR | 0.008 | |
| 40 | +| Scalar LR | 0.02 | |
| 41 | +| Scalar WD | 0.02 | |
| 42 | +| Grad clip | 0.3 | |
| 43 | +| Batch size | 786,432 tokens/step | |
| 44 | +| Seq len | 2048 | |
| 45 | +| Warmdown fraction | 0.667 | |
| 46 | +| EMA decay | 0.9965 | |
| 47 | +| Warmup steps | 20 | |
| 48 | +| Wallclock cap | 600s (590s effective) | |
| 49 | + |
| 50 | +## Quantization |
| 51 | + |
| 52 | +- GPTQ int6, percdamp=0.05, 64 calibration batches |
| 53 | +- 10s reserved for GPTQ at end of training |
| 54 | +- Selective pruning of ~290K lowest-error values |
| 55 | +- Brotli compression |
| 56 | +- Expected artifact: ~15.95 MB |
| 57 | + |
| 58 | +## Enhancement: BigramHash (Optional) |
| 59 | + |
| 60 | +One modification on top of the proven base: |
| 61 | +- BigramHash with 1536 buckets, dim 112 (sized smaller than our SOTA's 3072 to leave room for SP4096 vocab table) |
| 62 | +- Requires porting BigramHash and SmearGate classes from our SOTA script into PR #1421's script, plus wiring them into the GPT.__init__ and forward methods. This is ~60 lines of code. |
| 63 | +- Decision rule: Run seed 1337 vanilla FIRST to confirm reproduction. Then run seed 1337 with BigramHash. If artifact > 16MB or BPB regresses, strip it. |
| 64 | +- Rationale: PR #363 found BigramHash neutral on heavy looping (3x3), but PR #1421's minimal recurrence (2 layers, 1 extra pass) is much closer to flat, so BigramHash may still help. |
| 65 | + |
| 66 | +## RunPod Execution |
| 67 | + |
| 68 | +### Pod Setup |
| 69 | +```bash |
| 70 | +runpodctl pod create \ |
| 71 | + --template-id y5cejece4j \ |
| 72 | + --gpu-id "NVIDIA H100 80GB HBM3" \ |
| 73 | + --gpu-count 8 \ |
| 74 | + --name "pgolf-exp4-recurrence" \ |
| 75 | + --cloud-type SECURE |
| 76 | +``` |
| 77 | + |
| 78 | +### On-Pod Setup |
| 79 | +```bash |
| 80 | +cd /workspace |
| 81 | +git clone https://github.com/openai/parameter-golf.git |
| 82 | +cd parameter-golf |
| 83 | +pip install --break-system-packages zstandard brotli |
| 84 | +python3 data/cached_challenge_fineweb.py --variant sp4096 |
| 85 | +``` |
| 86 | + |
| 87 | +### Run Sequence |
| 88 | + |
| 89 | +1. **Run 1** (seed 1337): Vanilla PR #1421 script. Verify ~1.0925 BPB reproduction. |
| 90 | +2. **Run 2** (seed 1337): With BigramHash added. Compare to Run 1. |
| 91 | +3. **Runs 3-4** (seeds 42, 2024): Best config from above, 2 more seeds for 3-seed submission. |
| 92 | +4. Stop pod immediately. |
| 93 | + |
| 94 | +Each run: ~10 min training + ~5 min eval = ~15 min. Total: ~60 min. Cost: ~$22. |
| 95 | + |
| 96 | +### Script Transfer |
| 97 | +- Extract `train_gpt.py` from PR #1421 diff locally, save to `experiments/exp4_train_gpt.py` |
| 98 | +- SCP to pod: `scp -i ~/.runpod/ssh/RunPod-Key-Go -P <port> experiments/exp4_train_gpt.py root@<ip>:/workspace/parameter-golf/train_gpt.py` |
| 99 | + |
| 100 | +## Submission Structure |
| 101 | + |
| 102 | +``` |
| 103 | +records/track_10min_16mb/2026-04-06_DepthRecurrence_EMA0.9965/ |
| 104 | + README.md |
| 105 | + submission.json |
| 106 | + train_gpt.py |
| 107 | + train_seed1337.log |
| 108 | + train_seed42.log |
| 109 | + train_seed2024.log |
| 110 | +``` |
| 111 | + |
| 112 | +PR from `AbhayAnandUCSD/parameter-golf` fork to `openai/parameter-golf`. |
| 113 | + |
| 114 | +## Expected Results |
| 115 | + |
| 116 | +| Scenario | Expected BPB | Delta vs SOTA (1.1147) | |
| 117 | +|----------|-------------|----------------------| |
| 118 | +| Vanilla reproduction | ~1.093 | -0.022 | |
| 119 | +| With BigramHash | ~1.088-1.093 | -0.022 to -0.027 | |
| 120 | +| Worst case | ~1.10 | -0.015 | |
| 121 | + |
| 122 | +## Risk Mitigation |
| 123 | + |
| 124 | +| Risk | Mitigation | |
| 125 | +|------|-----------| |
| 126 | +| SP4096 data download fails | Modify script for SP1024 (change vocab_size, paths). Lose ~0.01 BPB. | |
| 127 | +| BigramHash breaks 16MB budget | Strip it, run vanilla. Already proven at 1.0925. | |
| 128 | +| Recurrence compile stall | Forward loop is explicit Python, not traced by torch.compile. Already handled in PR #1421. | |
| 129 | +| Pod unavailable | Try community cloud. If unavailable, wait and retry. | |
| 130 | +| Reproduction fails (>1.10 BPB) | Check data shard count (must be all shards, not 1). Verify SP4096 data downloaded correctly. | |
| 131 | + |
| 132 | +## Key Lessons From Failed Approach (PR #363) |
| 133 | + |
| 134 | +These informed our strategy but do NOT apply to PR #1421's minimal recurrence: |
| 135 | +- Heavy looping (3x3, 2x5) causes 22% step count loss and quantization compounding |
| 136 | +- PR #1421 avoids both: only 2 layers repeat once (minimal overhead), activated late at step 3000 |
| 137 | +- Noisy QAT was critical for heavy looping but unnecessary for minimal recurrence with GPTQ int6 |
| 138 | +- BigramHash was neutral on heavy loops but may help with minimal recurrence (untested) |
0 commit comments