Skip to content

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)#1290

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/depth-recurrence-muoneqr
Open

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)#1290
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/depth-recurrence-muoneqr

Conversation

@aryanbhosale
Copy link
Copy Markdown

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ

val_bpb = 1.1104 (3-seed mean, std 0.0009) | ~15.97 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Sliding bpb val_loss (nats) Artifact
42 96.7ms 6,204 1.1105 1.8751 15,974,737
314 96.7ms 6,205 1.1094 1.8731 15,972,993
999 96.7ms 6,205 1.1112 1.8762 15,969,221
Mean 96.7ms 6,205 1.1104 1.8748

SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.87481 nats. Delta: −0.00737 nats. Clears the 0.005-nat threshold (Welch t=−7.73, df=2.59).

Key Changes from PR #1019

1. Depth Recurrence (layers 4,5 repeated)

Layers 4 and 5 (U-Net hinge point) execute twice during the forward pass using the same physical parameter banks. Creates a virtual 13-layer network from an 11-layer parameter budget — zero extra parameters. Activates at step 3000 after the model has learned basic representations.

Lineage: PR #1204 by @msisovic (concept), PR #1260 by @dexhunter (implementation).

2. MuonEq-R (Row-Normalized Muon)

Row-normalizes gradient matrices before Newton-Schulz orthogonalization, equalizing row norms for better-conditioned optimization. Zero additional bytes. Source: arXiv:2603.28254, PR #1260 by @dexhunter.

3. Base stack from PR #1019 (unchanged)

AR self-generated Full Hessian GPTQ, XSA all 11 layers, BigramHash 3072×112, LeakyReLU(0.5)², selective ±1 pruning, LZMA preset=9.

Compliance

  • No TTT, no SLOT, no n-gram cache, no eval-time adaptation
  • AR self-generated GPTQ calibration (no external data during quantization)
  • All seeds within 600s training, <16MB artifact
  • Fully legal under all four conditions (Issue A Field Guide to Valid Submissions #1017)

Reproduction

SEED=42 BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
RECUR_LAYERS=4,5 RECUR_START_STEP=3000 TARGET_MB=15.9 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…04 (3-seed mean)

11L with depth recurrence (layers 4,5 repeated) + MuonEq-R optimizer
+ Full Hessian GPTQ with AR self-generated calibration on the PR openai#1019 stack.

3-seed mean: 1.1104 BPB / 1.8748 nats
Delta vs PR openai#1019: -0.0074 nats (Welch t=-7.73)
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296.
- Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping
- Parallel residuals: attn+mlp computed in parallel from layer 6 onward
- Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
Ports parallel residuals from PR openai#1296 to openai#1290 base:
- Block.__init__ accepts parallel flag
- Block.forward() computes attn+mlp in parallel when parallel=True
- GPT.__init__ passes parallel_start_layer to Block constructors
- Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7)
- Both base_model and eval_model wired up
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
Multi-resolution training:
- seq_len=512 for first 70% of wallclock (configurable via MULTIRES_SWITCH_FRAC)
- Switch to seq_len=2048 for remaining 30%
- Exploits ~2x faster steps at short seq for more total steps
- torch.compile recompiles once per shape change (~30s overhead)

Corrected openai#1290 env var defaults to match their run command:
- BIGRAM_VOCAB_SIZE: 2048 -> 3072
- BIGRAM_DIM: 128 -> 112
- WARMDOWN_ITERS: 3500 -> 4000
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config)
- WARMDOWN_ITERS: already 4000 (matches openai#1290 run command)
- MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup)
- BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant