Skip to content

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant
Open

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

Key Innovations

  1. MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
  2. Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
  3. Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
1337 5,541 106.5 1.0939 2.51667 15,933,457
42 5,530 106.7 1.0922 2.51279 15,981,324
0 5,543 106.5 1.0927 2.51394 15,960,050
Mean 5,538 106.6 1.0929 2.51447 15,958,277

Changes from PR #1218

PR #1218 This
val_bpb 1.09785 1.09290 (-0.00495)
val_loss ~2.526 nats 2.514 nats (-0.011)
Optimizer Muon MuonEq-R
Depth recurrence None Layers 4,5
Mixed quantization No 60 int6 + 6 int5

Credits

Test plan

  • 3-seed verification (1337, 42, 0) — all pass artifact + time + score
  • All seeds under 16,000,000 bytes
  • Train < 600s, eval < 600s
  • No TTT, no SLOT, no forbidden techniques
  • Rule checker passed (log + script)

…1.0929 (3-seed mean)

Adds three techniques to PR openai#1218's 4096-vocab high-WD stack:
- MuonEq-R optimizer (row-norm before NS5 orthogonalization)
- Depth recurrence on layers 4,5 (shared MLP, zero extra params)
- Mixed int5/int6 GPTQ via Hessian sensitivity ranking

3-seed mean: 1.0929 BPB / 2.5145 nats
All seeds under 16MB (max: 15,981,324 bytes)
No TTT, no SLOT, no eval-time adaptation.
@mikeapedia
Copy link
Copy Markdown

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants