Skip to content

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6
Open

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

  • val_bpb = 1.0912 (3-seed mean, std 0.0009) | 2.5106 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
  • WD-quantization synergy: higher weight decay (0.090) compresses 5% better, allowing ALL 66 layers at int6
  • All seeds under 16MB with 32K+ margins
  • No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: WD-Quantization Synergy

Higher WD (0.090 vs 0.085) → smaller weights → 5% better brotli compression → enough headroom for ALL 66 layers at int6 precision. The quantization quality gain exceeds the WD BPP cost:

Config WD N_INT6 Artifact val_bpb (s42)
PR #1260 0.085 60 15,981K 1.09217
PR #1279 0.085 61 15,997K 1.09170
This 0.090 66 15,967K 1.09057

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
42 5,540 106.5 1.0906 2.50910 15,967,483
0 5,536 106.6 1.0908 2.50973 15,962,242
1337 5,538 106.6 1.0923 2.51309 15,959,253
Mean 5,538 106.6 1.0912 2.51064 15,962,993

Changes from PR #1218

PR #1218 This
val_bpb 1.09785 1.09124 (-0.00661)
Weight decay 0.085 0.090
Optimizer Muon MuonEq-R
Depth recurrence None Layers 4,5 repeated
Quantization Mixed All int6 (66/66)

Credits

Test plan

  • 3-seed verification (42, 0, 1337) — all pass
  • All under 16MB (min margin: 32,517)
  • 4-seed tested (seed 7 also fits at 15,970,676)
  • No TTT, no SLOT

….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant