Non-record: 1x H100 SXM5 Explorations by User123331 · Pull Request #1608 · openai/parameter-golf

User123331 · 2026-04-14T00:44:53Z

Experiment Logs Score Sheet

Hardware: RunPod 1× NVIDIA H100 SXM5 80GB (Hopper SM90)
Dataset: FineWeb 10B tokens · SentencePiece BPE · seq_len=2048
Eval: Sliding window, stride=64, bits-per-byte (bpb) on val split
Budget: 600s wallclock per run (1×H100)
Files (.py scripts + logs): Google Drive

Full Experiment Ledger

Sorted by val bpb ascending (best first). Δbpb computed against reference run 11L_qk4_wd1500 (1.2450).

#	Run	Val bpb	Δbpb	Steps	Layers [enc/dec]	KV Heads	MLP Act	MLP Mult	Attention	RoPE	QK Gain	Warmdown	Regularization	Bigram Hash	Quant	Notes
1	`parallel_residuals`	1.2219	−0.023	2029	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	4096×64	INT5/INT6/brotli	Asymmetric parallel attn+MLP from L7 (α_mlp=0.05, untied 2nd MLP)
2	`ngram_cache`	1.2289	−0.016	2268	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	2816×160	INT5/INT6/brotli	Knuth hash, normal init
3	`baseline_fa3_build`	1.2293	−0.016	2265	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	4096×64	INT5/INT6/brotli	LN Scale on norm input, per-optimizer WD (muon=0.095, adam=0.02, embed=0.085)
4	`int6_qat`	1.2304	−0.015	2258	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	4096×64	GPTQ-lite+INT6+zstd-22	Per-row 5-percentile MSE clip search
5	`depth_recurrence`	1.2339	−0.011	2041	11 [5/6] → 13 virt	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	4096×64	INT5/INT6/brotli	L4–L5 looped ×2 shared weights at frac=0.50, 294ms/step
6	`11L_qk4_wd1500`	1.2450	REF	2074	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	Reference
7	`11L_qk4_wd1500_swa25`	1.2453	+0.000	2062	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.25,50)	10240×128	INT5/INT6/brotli	Earlier SWA start
8	`11L_qk8_wd1500`	1.2453	+0.000	2069	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	8.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	QK gain=8.0
9	`11L_qk6_wd1500`	1.2460	+0.001	2064	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	6.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	QK gain=6.0
10	`11L_qk4_wd1000`	1.2464	+0.001	2066	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1000	SWA(0.4,50)	10240×128	INT5/INT6/brotli
11	`11L_qk4_wd1500_mlr025`	1.2464	+0.001	2060	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	matrix_lr=0.025
12	`10L_qk4_wd1500`	1.2467	+0.002	2282	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli
13	`12L_qk4_wd1200`	1.2478	+0.003	1889	12 [6/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1200	SWA(0.4,50)	10240×128	INT5/INT6/brotli	30.2M params
14	`10L_qk4_wd2000`	1.2490	+0.004	2273	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	2000	SWA(0.4,50)	10240×128	INT5/INT6/brotli
15	`11L_qk4_wd1500_tied005_mlr025`	1.2493	+0.004	2054	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	tied_lr=0.05, mlr=0.025
16	`11L_qk4_wd1500_tied005`	1.2496	+0.005	2071	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	tied_embed_lr=0.05
17	`sp8192_full_stack`	1.2562	+0.011	1580	11 [5/6] → 16 virt	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	EMA(0.9965)	4096×64	SDClip+INT6/brotli	SP8192 vocab, L3–5 recurrence ×3, projection XSA all layers, symmetric parallel resid L7, 380ms/step
18	`11L_qk4_wd1500_fa3`	1.2593	+0.014	1763	11 [5/6]	4	ReLU²	3.0	FA3 (autograd)	Full 64	4.0	1500	SWA(0.4,50)	10240×128	INT5/INT6/zlib	Best FA3-only
19	`qk_gain`	1.2625	+0.018	2268	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	env override tuning
20	`naive_baseline_9L_mlp2_seq1024`	1.2660	+0.021	2272	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	9L, mlp3.0, seq2048
21	`mlp35`	1.2662	+0.021	2185	10 [5/5]	4	ReLU²	3.5	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	28.1M params
22	`baseline_10L`	1.2666	+0.022	2271	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli
23	`11L_qk4_070342`	1.2686	+0.024	2064	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	4.0	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	Rerun of crashed #10
24	`11layers`	1.2694	+0.024	2057	11 [5/6]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	+1 layer (5 enc, 6 dec, 5 skip)
25	`11L_mlp35`	1.2740	+0.029	1973	11 [5/6]	4	ReLU²	3.5	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	30.8M params
26	`wd3500`	1.2741	+0.029	2275	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3500	SWA(0.4,50)	10240×128	INT5/INT6/brotli	Longer warmdown hurt
27	`baseline_786k`	1.5187	+0.274	887	10 [5/5]	4	ReLU²	3.0	SDPA(Flash)	Full 64	1.5	3000	SWA(0.4,50)	10240×128	INT5/INT6/brotli	batch=786k, undertrained
28	`mega_164231`	1.8872	+0.642	1854	10 [5/5]	2	LeakyReLU(0.5)²	3.0	FA3+XSA	Partial 16	1.5	3000	EMA(0.997)	2048×64	INT5/INT6/brotli	Fully-stacked test build
—	`xsa_ema`	0.9051	INVALID	1777	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Partial 16	5.25	frac=0.72	EMA(0.997)	4096×64	INT6/brotli	Causal leakage bug, XSA mean-sub L7–L10, LN Scale on attn output
—	`11L_qk4_wd1200_fa3`	~1.285	+0.040	1468	11 [5/6]	4	ReLU²	3.0	FA3 (autograd)	Full 64	4.0	1200	SWA(0.4,50)	10240×128	INT5/INT6/zlib	Partial, no final eval
—	`score_first_ttt`	—	—	2255	11 [5/6]	4	LeakyReLU(0.5)²	4.0	FA3/SDPA	Full 64	5.25	frac=0.72	SWA(0.4,50)	4096×64	INT5/INT6/brotli	TTT: 3-epoch SGD(lr=0.002, mom=0.9) per 32k chunk, crashed on eval

Ablation Summary: Layer-Type Impact on Val bpb

Attention Mechanism

Method	Avg bpb	Δbpb	Verdict
SDPA(Flash) via `F.scaled_dot_product_attention`	1.245–1.274	—	Baseline
FA3 (primary) / SDPA (fallback)	1.222–1.234	−0.016	Better (confounded with other v2 changes)
FA3 raw op (no backward)	crash	—	Backward instability
FA3 autograd.Function (fwd+bwd)	1.259–1.285	+0.014	Faster but converges worse
FA3 + XSA mean-sub (pre-mask)	1.887	+0.642	Causal leak
FA3/SDPA + XSA mean-sub (pre-mask)	0.905	INVALID	Causal leak
FA3 + projection-based XSA (all layers)	1.2562	+0.011	Correct but step-starved

MLP Activation

Method	Avg bpb	Δbpb	Verdict
`ReLU(fc(x))` → `proj(x²)`	1.245–1.274	—	Baseline
`LeakyReLU(0.5, fc(x))` → `proj(x²)`	1.222–1.234	−0.016	Better

RoPE Coverage

Method	Avg bpb	Δbpb	Verdict
Full-dim RoPE (all 64 head dims)	1.222–1.274	—	Baseline
Partial RoPE (first 16 of 64 dims)	0.905–1.887	confounded	Always paired with XSA/bad configs

Depth & Width

Method	Val bpb	Δbpb	Steps	Verdict
10L [5/5], MLP 2.0	1.2660	+0.021	2272
10L [5/5], MLP 3.0	1.247–1.267	+0.002	2273
10L [5/5], MLP 3.5	1.2662	+0.021	2185
11L [5/6], MLP 3.0	1.245–1.269	REF	2069	Best zone
11L [5/6], MLP 3.5	1.2740	+0.029	1973	MLP 3.5 hurts on 11L
11L [5/6], MLP 4.0	1.222–1.234	−0.016	2265	Best with LeakyReLU
12L [6/6], MLP 3.0	1.2478	+0.003	1889	−185 steps

KV Heads (GQA Ratio)

KV Heads	GQA Ratio	Val bpb	Δbpb	Verdict
2	4:1	1.887	+0.642	Too sparse (confounded)
4	2:1	1.222–1.274	—	All runs

Regularization

Method	Val bpb	Δbpb	Verdict
SWA (start=0.4, every=50)	1.245–1.274	—	Baseline
SWA (start=0.25, every=50)	1.2453	+0.000	Neutral
EMA (decay=0.997)	0.905–1.887	confounded
EMA (decay=0.9965)	1.2562	+0.011	Step-starved
Weight Decay=0.04 (Muon hardcoded 0.04, AdamW uses hyperparam)	1.245	—
Per-optimizer WD (muon=0.095, adam=0.02, embed=0.085)	1.229	−0.016	Better

Depth Recurrence

Method	Val bpb	Δbpb	Steps	ms/step	Verdict
None	1.2293	—	2265	265
L4–L5 loop ×2 at frac=0.50	1.2339	+0.005	2041	294	−224 steps for +29ms/step
L3–L5 loop ×3 at frac=0.35	1.2562	+0.027	1580	380	Severe step starvation

Parallel Residuals

Method	Val bpb	Δbpb	Verdict
Sequential (all layers)	1.229	—
Parallel attn+MLP from L7 (untied MLP, α_mlp)	1.2219	−0.007	Best bpb
Parallel attn+MLP from L7 (symmetric)	1.2562	+0.027	Step-starved

Embedding & Hash

Method	Val bpb	Δbpb	Verdict
BigramHash 10240×128 (xor hash, zero-init)	1.245	—
BigramHash 2048×64 (xor hash, zero-init)	1.887	confounded
BigramHash 4096×64 (xor hash, zero-init)	1.222–1.234	—
BigramHash 2816×160 (Knuth hash, normal-init)	1.2289	−0.000	Neutral, submittable
Tied embed_lr=0.03	1.229	—
Tied embed_lr=0.05	1.249	+0.005	Worse

Quantization & Compression

Method	Val bpb	Δbpb	Verdict
INT5(MLP)+INT6(Attn)+3% prune+brotli-11	1.2293	—
GPTQ-lite+INT6+zstd-22	1.2304	+0.001	Neutral
SDClip+INT6+brotli-11	1.2562	+0.027

Test-Time Training (TTT)

Method	Val bpb	Δbpb	Verdict
Score-first TTT: SGD(lr=0.002, mom=0.9), 3 epochs, 32k chunk	~1.249	+0.020	LR too aggressive, degrades model

Personal Tests Leaderboard

Rank	Run	Val bpb	Δ vs Ref	Key Delta
1	`parallel_residuals`	1.2219	−0.023	Parallel attn+MLP from L7 (untied)
2	`ngram_cache`	1.2289	−0.016	BigramHash 2816×160 Knuth hash
3	`baseline_fa3_build`	1.2293	−0.016	LeakyReLU(0.5)², QK=5.25, per-optimizer WD, MLP 4.0
4	`int6_qat`	1.2304	−0.015	GPTQ-lite+zstd-22
5	`depth_recurrence`	1.2339	−0.011	L4–L5 recurrence ×2
6	`11L_qk4_wd1500`	1.2450	REF	ReLU², QK=4.0, MLP 3.0, SWA
7	`11L_qk4_wd1500_swa25`	1.2453	+0.000	SWA start 0.25
8	`11L_qk8_wd1500`	1.2453	+0.000	QK gain=8.0
9	`11L_qk6_wd1500`	1.2460	+0.001	QK gain=6.0
10	`11L_qk4_wd1000`	1.2464	+0.001	Warmdown 1000
11	`11L_qk4_wd1500_mlr025`	1.2464	+0.001	matrix_lr=0.025
12	`10L_qk4_wd1500`	1.2467	+0.002	10 layers
13	`12L_qk4_wd1200`	1.2478	+0.003	12 layers
14	`10L_qk4_wd2000`	1.2490	+0.004	10L, wd=2000
15	`11L_qk4_wd1500_fa3`	1.2593	+0.014	FA3 autograd
16	`sp8192_full_stack`	1.2562	+0.011	Full PR#1493 clone, step-starved
—	`xsa_ema`	0.9051	INVALID	Causal leakage bug
—	`score_first_ttt`	~1.249	FAILED	SGD lr too aggressive

Fix critical bugs: MoS params now included in optimizer groups, use NLL loss (not cross_entropy) since MoS returns log-probs, skip logit softcap for MoS path, re-normalize after LoRA correction. Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget. Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64 A/B comparison (10 min each on 1x H100). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Baseline bpb already known from prior runs (~1.2244). Saves 10 min of GPU time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>