Skip to content

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant
Open

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

Key Innovations

  1. MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
  2. Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
  3. Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
1337 5,541 106.5 1.0939 2.51667 15,933,457
42 5,530 106.7 1.0922 2.51279 15,981,324
0 5,543 106.5 1.0927 2.51394 15,960,050
Mean 5,538 106.6 1.0929 2.51447 15,958,277

Changes from PR #1218

PR #1218 This
val_bpb 1.09785 1.09290 (-0.00495)
val_loss ~2.526 nats 2.514 nats (-0.011)
Optimizer Muon MuonEq-R
Depth recurrence None Layers 4,5
Mixed quantization No 60 int6 + 6 int5

Credits

Test plan

  • 3-seed verification (1337, 42, 0) — all pass artifact + time + score
  • All seeds under 16,000,000 bytes
  • Train < 600s, eval < 600s
  • No TTT, no SLOT, no forbidden techniques
  • Rule checker passed (log + script)

…1.0929 (3-seed mean)

Adds three techniques to PR openai#1218's 4096-vocab high-WD stack:
- MuonEq-R optimizer (row-norm before NS5 orthogonalization)
- Depth recurrence on layers 4,5 (shared MLP, zero extra params)
- Mixed int5/int6 GPTQ via Hessian sensitivity ranking

3-seed mean: 1.0929 BPB / 2.5145 nats
All seeds under 16MB (max: 15,981,324 bytes)
No TTT, no SLOT, no eval-time adaptation.
@mikeapedia
Copy link
Copy Markdown

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

HateBunnyPlzzz added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 2, 2026
Approaches revamped (old eval-only approaches removed):
- 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors)
- 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability)
- 03: SVD + Quantized Factors (13 layers via spectral compression)
- 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation)
- 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min)

Unmerged PR research saved to unmerged_runs/:
- PR openai#1263: SLOT (0.9354 BPB, legality contested)
- PR openai#1246: Trinity Ternary (0.9650 BPB)
- PR openai#1241: MDLM Diffusion (0.9901 BPB)
- PR openai#1252: WARP (1.0713 BPP)
- PR openai#1257: Complement Training (1.0855 BPB)
- PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB)
- PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB)
- PR openai#1254: XSA + LoRA TTT (1.1070 BPB)

Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Omrigotlieb added a commit to Omrigotlieb/parameter-golf that referenced this pull request Apr 3, 2026
Row-normalize the gradient update before Newton-Schulz orthogonalization.
From PR openai#1260: ~0.001 BPB free improvement, zero extra parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 3, 2026
… (3-seed mean)

Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer)
with a smaller mini runner (21,396 bytes) that creates enough headroom.

3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7)
All seeds under 16MB (max: 15,996,591 bytes)
No TTT, no SLOT, no eval-time adaptation.

Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP),
61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression.

Built on PR openai#1218 by @clarkkev.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 3, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…on for Muon optimizer

From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with
Lightweight Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf
PRs, top record PR openai#1260 = val_bpb 1.0929 (3-seed mean).

Inserts row normalization between Patch 17 Mousse block and Newton-Schulz:

  row_norm[i] = sqrt(sum_j G[i,j]^2)
  G[i,j] = G[i,j] / row_norm[i]

Distinct from Mousse: Mousse is row+col (G/||row||/||col||), MuonEq-R is
row-only (G/||row||). They can stack independently. Gated by USE_MUONEQ_R=1,
falls back gracefully when unset.

4 MR experiments queued for validation:
  MR0_alone, MR1_plus_leaky_ng, MR2_seed42, MR3_mousse_plus_muoneqr

This is the second optimizer-side patch in two fires. Both patches fit our
train_loss metric so they can validate on cheap GPU loop without H100
escalation. If either lands within champion noise band 3.27-3.30, defensible
ship for final stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
Ishan-Sinha123 pushed a commit to Ishan-Sinha123/parameter-golf that referenced this pull request Apr 10, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
PapaFranku4647 pushed a commit to PapaFranku4647/parameter-golf-lucas-bryant that referenced this pull request Apr 11, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: f-string: expecting '}' (line 563)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: f-string: expecting '}' (line 563). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

123-code pushed a commit to 123-code/parameter-golf that referenced this pull request Apr 19, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants