Skip to content

Commit eb8d687

Browse files
Add design spec for Experiment 4: Depth Recurrence
Adopt PR openai#1421's proven depth recurrence script (1.0925 BPB) as base, with optional BigramHash enhancement. Target ~1.09 BPB to beat merged SOTA (1.1147).
1 parent 9d070df commit eb8d687

1 file changed

Lines changed: 138 additions & 0 deletions

File tree

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Experiment 4: Depth Recurrence — Design Spec
2+
3+
## Goal
4+
5+
Beat merged SOTA (1.1147 BPB) using depth recurrence. Target: ~1.09 BPB.
6+
7+
## Strategy
8+
9+
Adopt PR #1421's proven script (1.0925 BPB, 3-seed mean) as our base. Optionally add BigramHash. This is low-risk because the script is already validated at competition scale.
10+
11+
## Why PR #1421 Over Porting Recurrence Into Our Script
12+
13+
Our SP1024 SOTA (1.1147) is already 0.022 BPB behind PR #1421 (1.0925). The gap comes from 6+ independent improvements (SP4096, MuonEq-R, skip gates, parallel residuals, QK-Gain 5.0, WD 0.09, EMA 0.9965) — not just recurrence. Porting all of these into our parameter-bank architecture would be high effort and high risk. Using the proven script directly is the pragmatic choice.
14+
15+
## Architecture (PR #1421, verbatim)
16+
17+
- 11 physical layers, 512d, 8 heads, 4 KV heads (GQA)
18+
- Depth recurrence: layers 4,5 repeat once (13 virtual layers: `[0,1,2,3,4,5,4,5,6,7,8,9,10]`)
19+
- Recurrence activates at step 3000 (~55% through training)
20+
- Skip gates: learnable sigmoid gating on U-Net skip connections
21+
- Parallel residuals: layers 7+ run attention and MLP in parallel lanes, merged via learnable scalar
22+
- SP4096 tokenizer (SentencePiece 4096 BPE)
23+
- MuonEq-R: row normalization before Newton-Schulz orthogonalization
24+
- Value Embedding: dim=128, layers 9,10
25+
- QK-Gain: learnable per-head Q scaling, init=5.0
26+
- Tied embeddings, logit softcap=30.0, partial RoPE (16/64 dims)
27+
- XSA on all 11 layers
28+
29+
## Training Hyperparameters
30+
31+
| Parameter | Value |
32+
|-----------|-------|
33+
| Muon LR | 0.02 |
34+
| Muon momentum | 0.99 |
35+
| Muon WD | 0.09 |
36+
| Muon backend steps | 5 |
37+
| Embed LR | 0.6 |
38+
| Embed WD | 0.09 |
39+
| Head LR | 0.008 |
40+
| Scalar LR | 0.02 |
41+
| Scalar WD | 0.02 |
42+
| Grad clip | 0.3 |
43+
| Batch size | 786,432 tokens/step |
44+
| Seq len | 2048 |
45+
| Warmdown fraction | 0.667 |
46+
| EMA decay | 0.9965 |
47+
| Warmup steps | 20 |
48+
| Wallclock cap | 600s (590s effective) |
49+
50+
## Quantization
51+
52+
- GPTQ int6, percdamp=0.05, 64 calibration batches
53+
- 10s reserved for GPTQ at end of training
54+
- Selective pruning of ~290K lowest-error values
55+
- Brotli compression
56+
- Expected artifact: ~15.95 MB
57+
58+
## Enhancement: BigramHash (Optional)
59+
60+
One modification on top of the proven base:
61+
- BigramHash with 1536 buckets, dim 112 (sized smaller than our SOTA's 3072 to leave room for SP4096 vocab table)
62+
- Requires porting BigramHash and SmearGate classes from our SOTA script into PR #1421's script, plus wiring them into the GPT.__init__ and forward methods. This is ~60 lines of code.
63+
- Decision rule: Run seed 1337 vanilla FIRST to confirm reproduction. Then run seed 1337 with BigramHash. If artifact > 16MB or BPB regresses, strip it.
64+
- Rationale: PR #363 found BigramHash neutral on heavy looping (3x3), but PR #1421's minimal recurrence (2 layers, 1 extra pass) is much closer to flat, so BigramHash may still help.
65+
66+
## RunPod Execution
67+
68+
### Pod Setup
69+
```bash
70+
runpodctl pod create \
71+
--template-id y5cejece4j \
72+
--gpu-id "NVIDIA H100 80GB HBM3" \
73+
--gpu-count 8 \
74+
--name "pgolf-exp4-recurrence" \
75+
--cloud-type SECURE
76+
```
77+
78+
### On-Pod Setup
79+
```bash
80+
cd /workspace
81+
git clone https://github.com/openai/parameter-golf.git
82+
cd parameter-golf
83+
pip install --break-system-packages zstandard brotli
84+
python3 data/cached_challenge_fineweb.py --variant sp4096
85+
```
86+
87+
### Run Sequence
88+
89+
1. **Run 1** (seed 1337): Vanilla PR #1421 script. Verify ~1.0925 BPB reproduction.
90+
2. **Run 2** (seed 1337): With BigramHash added. Compare to Run 1.
91+
3. **Runs 3-4** (seeds 42, 2024): Best config from above, 2 more seeds for 3-seed submission.
92+
4. Stop pod immediately.
93+
94+
Each run: ~10 min training + ~5 min eval = ~15 min. Total: ~60 min. Cost: ~$22.
95+
96+
### Script Transfer
97+
- Extract `train_gpt.py` from PR #1421 diff locally, save to `experiments/exp4_train_gpt.py`
98+
- SCP to pod: `scp -i ~/.runpod/ssh/RunPod-Key-Go -P <port> experiments/exp4_train_gpt.py root@<ip>:/workspace/parameter-golf/train_gpt.py`
99+
100+
## Submission Structure
101+
102+
```
103+
records/track_10min_16mb/2026-04-06_DepthRecurrence_EMA0.9965/
104+
README.md
105+
submission.json
106+
train_gpt.py
107+
train_seed1337.log
108+
train_seed42.log
109+
train_seed2024.log
110+
```
111+
112+
PR from `AbhayAnandUCSD/parameter-golf` fork to `openai/parameter-golf`.
113+
114+
## Expected Results
115+
116+
| Scenario | Expected BPB | Delta vs SOTA (1.1147) |
117+
|----------|-------------|----------------------|
118+
| Vanilla reproduction | ~1.093 | -0.022 |
119+
| With BigramHash | ~1.088-1.093 | -0.022 to -0.027 |
120+
| Worst case | ~1.10 | -0.015 |
121+
122+
## Risk Mitigation
123+
124+
| Risk | Mitigation |
125+
|------|-----------|
126+
| SP4096 data download fails | Modify script for SP1024 (change vocab_size, paths). Lose ~0.01 BPB. |
127+
| BigramHash breaks 16MB budget | Strip it, run vanilla. Already proven at 1.0925. |
128+
| Recurrence compile stall | Forward loop is explicit Python, not traced by torch.compile. Already handled in PR #1421. |
129+
| Pod unavailable | Try community cloud. If unavailable, wait and retry. |
130+
| Reproduction fails (>1.10 BPB) | Check data shard count (must be all shards, not 1). Verify SP4096 data downloaded correctly. |
131+
132+
## Key Lessons From Failed Approach (PR #363)
133+
134+
These informed our strategy but do NOT apply to PR #1421's minimal recurrence:
135+
- Heavy looping (3x3, 2x5) causes 22% step count loss and quantization compounding
136+
- PR #1421 avoids both: only 2 layers repeat once (minimal overhead), activated late at step 3000
137+
- Noisy QAT was critical for heavy looping but unnecessary for minimal recurrence with GPTQ int6
138+
- BigramHash was neutral on heavy loops but may help with minimal recurrence (untested)

0 commit comments

Comments
 (0)