Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ

aryanbhosale · 2026-04-03T07:52:51Z

val_bpb = 1.1104 (3-seed mean, std 0.0009) | ~15.97 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Sliding bpb	val_loss (nats)	Artifact
42	96.7ms	6,204	1.1105	1.8751	15,974,737
314	96.7ms	6,205	1.1094	1.8731	15,972,993
999	96.7ms	6,205	1.1112	1.8762	15,969,221
Mean	96.7ms	6,205	1.1104	1.8748

SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.87481 nats. Delta: −0.00737 nats. Clears the 0.005-nat threshold (Welch t=−7.73, df=2.59).

Key Changes from PR #1019

1. Depth Recurrence (layers 4,5 repeated)

Layers 4 and 5 (U-Net hinge point) execute twice during the forward pass using the same physical parameter banks. Creates a virtual 13-layer network from an 11-layer parameter budget — zero extra parameters. Activates at step 3000 after the model has learned basic representations.

Lineage: PR #1204 by @msisovic (concept), PR #1260 by @dexhunter (implementation).

2. MuonEq-R (Row-Normalized Muon)

Row-normalizes gradient matrices before Newton-Schulz orthogonalization, equalizing row norms for better-conditioned optimization. Zero additional bytes. Source: arXiv:2603.28254, PR #1260 by @dexhunter.

3. Base stack from PR #1019 (unchanged)

AR self-generated Full Hessian GPTQ, XSA all 11 layers, BigramHash 3072×112, LeakyReLU(0.5)², selective ±1 pruning, LZMA preset=9.

Compliance

No TTT, no SLOT, no n-gram cache, no eval-time adaptation
AR self-generated GPTQ calibration (no external data during quantization)
All seeds within 600s training, <16MB artifact
Fully legal under all four conditions (Issue A Field Guide to Valid Submissions #1017)

Reproduction

SEED=42 BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
RECUR_LAYERS=4,5 RECUR_START_STEP=3000 TARGET_MB=15.9 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base model + GPTQ + XSA-all + BigramHash: PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 by @abaybektursun
Depth Recurrence concept: PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 by @msisovic
MuonEq-R + Depth Recurrence implementation: PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 by @dexhunter
Parallel Muon: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 by @abaybektursun
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee
LN Scale + Partial RoPE: PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 by @jfprincz

…04 (3-seed mean) 11L with depth recurrence (layers 4,5 repeated) + MuonEq-R optimizer + Full Hessian GPTQ with AR self-generated calibration on the PR openai#1019 stack. 3-seed mean: 1.1104 BPB / 1.8748 nats Delta vs PR openai#1019: -0.0074 nats (Welch t=-7.73)

Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296. - Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping - Parallel residuals: attn+mlp computed in parallel from layer 6 onward - Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars

…#1019/sp1024)

Ports parallel residuals from PR openai#1296 to openai#1290 base: - Block.__init__ accepts parallel flag - Block.forward() computes attn+mlp in parallel when parallel=True - GPT.__init__ passes parallel_start_layer to Block constructors - Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7) - Both base_model and eval_model wired up

…#1019/sp1024)

Multi-resolution training: - seq_len=512 for first 70% of wallclock (configurable via MULTIRES_SWITCH_FRAC) - Switch to seq_len=2048 for remaining 30% - Exploits ~2x faster steps at short seq for more total steps - torch.compile recompiles once per shape change (~30s overhead) Corrected openai#1290 env var defaults to match their run command: - BIGRAM_VOCAB_SIZE: 2048 -> 3072 - BIGRAM_DIM: 128 -> 112 - WARMDOWN_ITERS: 3500 -> 4000

- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config) - WARMDOWN_ITERS: already 4000 (matches openai#1290 run command) - MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup) - BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)

aryanbhosale mentioned this pull request Apr 3, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026

Base: openai#1290 train_gpt.py (depth recurrence + MuonEq-R on openai…

c7335bf

…#1019/sp1024)

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026

Base: openai#1290 train_gpt.py (depth recurrence + MuonEq-R on openai…

4fd6073

…#1019/sp1024)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)#1290