Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0

aryanbhosale · 2026-04-04T09:32:35Z

val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

Track A — Fixed Predictor (No eval-time adaptation)

3-Seed Results

Seed	Sliding BPB	Artifact
42	1.0894	15,999,533
314	1.0898	15,992,752
999	1.0899	15,988,473
Mean	1.0897

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.

Key Techniques

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — virtual 13-layer from 11 physical. PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — separate attn/MLP lanes. PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254. PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Full GPTQ int6 + Brotli + LZMA Compressed Wrapper (~24KB code)

Compliance

No TTT — no weight updates during evaluation
No SLOT — no eval-time delta optimization
No n-gram cache — no eval-time statistics
No eval-time adaptation of any kind — model weights completely frozen
GPTQ calibration within training budget
Standard autoregressive sliding-window eval (stride=64)
All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee

…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.

@valerio-oai

… Parallel Residuals path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score TTT) - N-gram PRs openai#727/openai#741 CLOSED (illegal); openai#758/openai#731 open but same risk - Merged SOTA unchanged at 1.1147 - New high-EV targets: PR openai#1351 (Discriminative TTT, 1.0807) and PR openai#1334 (SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R, 1.0897) - SLOT still unruled in Issue openai#140 — blocked until @valerio-oai rules - CLAUDE.md updated to v8.0 with corrected strategy and Session 5 lessons https://claude.ai/code/session_01X5rVjJpYyqm8DuWTNy2gkt

Comprehensive analysis of current leaderboard state (Apr 4, 2026): - Non-SLOT frontier at 1.0897 BPB (PR openai#1334) - Pre-quant TTT adds -0.009 BPP (PR openai#1351, 1.0807 BPB) - Causal SLOT adds -0.088 BPP (PR openai#1350, 1.0046 BPB) - GPTQ+TTT incompatibility confirmed post-quant, works pre-quant - FiLM gap analysis: ~0.05-0.09 BPP behind frontier - Three strategic paths identified Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline Key changes vs PR openai#1334: - Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932) - MATRIX_LR=0.022 (validated for WD=0.090) - MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence) - NoPE explicitly disabled (nope_every_n=0) after critique - Trackio experiment tracking added Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 + Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli

v2 (focal+warmstart+clamp) gives identical 1.2658 BPB to v1 L-BFGS. L-BFGS converges too fast for these tricks to matter. Competitiveness analysis: - FiLM beats SOTA by -0.095 BPP on 1×H100 - Extrapolated 8×H100: ~1.00-1.05 BPB - Should beat non-SLOT frontier (PR openai#1334: 1.09) - Uncertain vs causal SLOT frontier (PR openai#1350: 1.00) because our causal SLOT gives -0.035 vs their -0.087 8×H100 test is worth running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lysis Novel ideas explored (Bitter Lesson aligned): - GDN hybrid: KILLED — FA3 is 3-16x faster than GDN on H100 - ACT transformer: KILLED — no training speedup (all iters must run for gradients) - 3x5 (512d): 517ms/step, 1.893 BPB vs baseline 331ms/step, 1.722 BPB - 3x5 (768d): 923ms/step, ~2.08 BPB — wider doesn't help - Root cause: ACT only helps when computation can actually be skipped during training Competition frontier analysis: - Legal record frontier: 1.005 BPB (PR openai#1350, L-BFGS causal SLOT) - Clean base frontier: 1.0897 BPB (PR openai#1334, SP4096+DepthRecur+MuonEq-R) - SLOT adds -0.087 BPB on top of base Remaining novel ideas to test: parallel SLOT beams, amortized SLOT, learned weight compression, progressive depth training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Applies Cautious Muon (arXiv:2411.16085) to mask Muon optimizer updates where Newton-Schulz direction disagrees with raw gradient sign. Built on PR openai#1334 base with SP4096, depth recurrence, parallel residuals, MuonEq-R, QK-Gain 5.0, GPTQ INT6 + Brotli. 3-seed mean: 1.1604 bpb (seeds 42, 314, 999) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

primary path - CRITICAL: PR openai#1351 (Discriminative TTT, 1.0807) self-closed by author on 2026-04-05 — pre-quant AdamW TTT ruled as pre-eval adaptation on val data. Removed pre-quant TTT from technique table and plan. - Updated strategy to PR openai#1334 (Depth Recur + Parallel Residuals + MuonEq-R, 1.0897) as primary architecture target — zero legality flags. - Logged new PRs: openai#1379 (0.4162, n-gram mixer), openai#1376 (0.7094, SLOT-24 + pre-quant TTT), openai#1364 (1.1025, pre-quant TTT at risk), openai#1370 (1.003, GDN). - SLOT and pre-quant TTT both blocked; discriminative TTT post-quant still legal. - Updated CLAUDE.md Competition Strategy + Technique Reference + Lessons (v9.0). https://claude.ai/code/session_01RTLvTuYBp9YMtudwrY8mYM

…897)

…etermines training length)

…0.024 late) - LeakyReLU negative_slope 0.5 -> 0.9 (Issue openai#140 sweep evidence) - Split-LR: layers 0-5 at 0.020, layers 6-10 at 0.024 (PR openai#1179) - WD=0.090 and Brotli-11 already in openai#1334 base (no change needed)

This was referenced Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

X-Abhishek-X mentioned this pull request Apr 5, 2026

Cautious Muon + SP4096 + Depth Recurrence — val_bpb 1.1604 (non-record) #1381

Open

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Switch base: openai#1334 SP4096+DepthRecur+ParallelResid+MuonEqR (1.0…

647c26f

…897)

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Fix: add brotli to pip install (required by openai#1334 base)

86d83c9

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+ParallelResid+MuonEqR for LoRA TTT lane

463266d

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Fix iterations: 200 -> 20000 to match PR openai#1334 (wallclock cap d…

299fedf

…etermines training length)

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Disable CompTrain (default 0) for clean openai#1334 baseline eval

75da423

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

4c97398

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

9ebfdb3

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

b08bf33

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

ce86987

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

02286f9

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

3de320a

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

QK_GAIN 5.0 (from openai#1334 proven technique)

d3b8191

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

WD 0.09 (muon_wd + embed_wd from openai#1334), revert QK_GAIN to 4.0

9ba966b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1334