Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914) by NewyorkDev · Pull Request #309 · openai/parameter-golf

NewyorkDev · 2026-03-21T04:27:58Z

Summary

CLASE-Quant: Adaptive Layer-Sensitive Quantization + Extended Context Training

Mean val_bpb: 1.1914 (3 seeds on 8xH100 SXM, 10 min wallclock)
Artifact: ~11.5 MB (well under 16 MB limit)
Improvement over baseline: -0.033 val_bpb

Novel Techniques

1. CLASE-inspired Adaptive Per-Layer Quantization

Not all transformer layers are equal. Inspired by the CLASE Technique (HDXspeed, March 2026), we apply non-uniform quantization:

Boundary layers (blocks 0, 1, N-2, N-1): int8 — highest sensitivity
Middle layers (blocks 2-7): int6 — skip connections provide redundancy
Tied embeddings: fp16 passthrough — dual input/output role
Saves ~15% model size vs uniform int8 while preserving accuracy

2. Ramping Weight Decay

Weight decay increases from 0.02 to 0.08 during warmdown (cosine schedule), progressively compressing weight distributions for cleaner post-training quantization.

3. Extended Context + Sliding Window

Train at 2048 seq len with tuned hyperparameters (LR 0.03, momentum 0.97, batch 393K), eval with stride-64 sliding window.

Results (8xH100 SXM, RunPod Secure Cloud)

Seed	val_loss	val_bpb	Steps	ms/step	Artifact
1337	2.01365	1.19260	10854	55.27	11.46 MB
42	2.00862	1.18962	12761	47.01	11.76 MB
7	2.01284	1.19212	13205	45.43	11.50 MB
Mean	2.01170	1.19144

Submission Checklist

README.md with detailed explanation
submission.json with seed results and metadata
train_gpt.py (self-contained, runs from records folder)
3 training logs (seeds 1337, 42, 7) on 8xH100 SXM
All runs complete within 600s wallclock
Artifact under 16 MB (11.5 MB)
No external downloads during evaluation

Acknowledgments

Built with Claude (Anthropic) as AI pair programmer. Builds on techniques from notapplica, Matthew Li, samacqua, Spokane Way, Nan Liu, and Renier Velazco.

…al_bpb=1.1914) Novel techniques: - CLASE-inspired adaptive per-layer quantization (int8 boundary, int6 middle) - Ramping weight decay during warmdown (0.02->0.08) - 2048 training seq len + sliding window eval (stride=64) 3 seeds on 8xH100 SXM: 1.1926, 1.1896, 1.1921 (mean 1.1914) Artifact: ~11.5 MB (well under 16 MB limit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Compresses weight distributions during warmdown for cleaner post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB). QAT still enabled alongside. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)#309

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)#309
NewyorkDev wants to merge 1 commit intoopenai:mainfrom
NewyorkDev:submission/clase-quant-adaptive-layer-quant

NewyorkDev commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NewyorkDev commented Mar 21, 2026

Summary

Novel Techniques

1. CLASE-inspired Adaptive Per-Layer Quantization

2. Ramping Weight Decay

3. Extended Context + Sliding Window

Results (8xH100 SXM, RunPod Secure Cloud)

Submission Checklist

Acknowledgments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant