Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)#1290
Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
Open
Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)#1290aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
Conversation
…04 (3-seed mean) 11L with depth recurrence (layers 4,5 repeated) + MuonEq-R optimizer + Full Hessian GPTQ with AR self-generated calibration on the PR openai#1019 stack. 3-seed mean: 1.1104 BPB / 1.8748 nats Delta vs PR openai#1019: -0.0074 nats (Welch t=-7.73)
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 3, 2026
Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296. - Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping - Parallel residuals: attn+mlp computed in parallel from layer 6 onward - Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 3, 2026
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 3, 2026
Ports parallel residuals from PR openai#1296 to openai#1290 base: - Block.__init__ accepts parallel flag - Block.forward() computes attn+mlp in parallel when parallel=True - GPT.__init__ passes parallel_start_layer to Block constructors - Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7) - Both base_model and eval_model wired up
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 3, 2026
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 4, 2026
Multi-resolution training: - seq_len=512 for first 70% of wallclock (configurable via MULTIRES_SWITCH_FRAC) - Switch to seq_len=2048 for remaining 30% - Exploits ~2x faster steps at short seq for more total steps - torch.compile recompiles once per shape change (~30s overhead) Corrected openai#1290 env var defaults to match their run command: - BIGRAM_VOCAB_SIZE: 2048 -> 3072 - BIGRAM_DIM: 128 -> 112 - WARMDOWN_ITERS: 3500 -> 4000
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 4, 2026
- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config) - WARMDOWN_ITERS: already 4000 (matches openai#1290 run command) - MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup) - BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ
val_bpb = 1.1104 (3-seed mean, std 0.0009) | ~15.97 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.87481 nats. Delta: −0.00737 nats. Clears the 0.005-nat threshold (Welch t=−7.73, df=2.59).
Key Changes from PR #1019
1. Depth Recurrence (layers 4,5 repeated)
Layers 4 and 5 (U-Net hinge point) execute twice during the forward pass using the same physical parameter banks. Creates a virtual 13-layer network from an 11-layer parameter budget — zero extra parameters. Activates at step 3000 after the model has learned basic representations.
Lineage: PR #1204 by @msisovic (concept), PR #1260 by @dexhunter (implementation).
2. MuonEq-R (Row-Normalized Muon)
Row-normalizes gradient matrices before Newton-Schulz orthogonalization, equalizing row norms for better-conditioned optimization. Zero additional bytes. Source: arXiv:2603.28254, PR #1260 by @dexhunter.
3. Base stack from PR #1019 (unchanged)
AR self-generated Full Hessian GPTQ, XSA all 11 layers, BigramHash 3072×112, LeakyReLU(0.5)², selective ±1 pruning, LZMA preset=9.
Compliance
Reproduction
Credits