Skip to content

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)#309

Open
NewyorkDev wants to merge 1 commit intoopenai:mainfrom
NewyorkDev:submission/clase-quant-adaptive-layer-quant
Open

Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914)#309
NewyorkDev wants to merge 1 commit intoopenai:mainfrom
NewyorkDev:submission/clase-quant-adaptive-layer-quant

Conversation

@NewyorkDev
Copy link

Summary

CLASE-Quant: Adaptive Layer-Sensitive Quantization + Extended Context Training

  • Mean val_bpb: 1.1914 (3 seeds on 8xH100 SXM, 10 min wallclock)
  • Artifact: ~11.5 MB (well under 16 MB limit)
  • Improvement over baseline: -0.033 val_bpb

Novel Techniques

1. CLASE-inspired Adaptive Per-Layer Quantization

Not all transformer layers are equal. Inspired by the CLASE Technique (HDXspeed, March 2026), we apply non-uniform quantization:

  • Boundary layers (blocks 0, 1, N-2, N-1): int8 — highest sensitivity
  • Middle layers (blocks 2-7): int6 — skip connections provide redundancy
  • Tied embeddings: fp16 passthrough — dual input/output role
  • Saves ~15% model size vs uniform int8 while preserving accuracy

2. Ramping Weight Decay

Weight decay increases from 0.02 to 0.08 during warmdown (cosine schedule), progressively compressing weight distributions for cleaner post-training quantization.

3. Extended Context + Sliding Window

Train at 2048 seq len with tuned hyperparameters (LR 0.03, momentum 0.97, batch 393K), eval with stride-64 sliding window.

Results (8xH100 SXM, RunPod Secure Cloud)

Seed val_loss val_bpb Steps ms/step Artifact
1337 2.01365 1.19260 10854 55.27 11.46 MB
42 2.00862 1.18962 12761 47.01 11.76 MB
7 2.01284 1.19212 13205 45.43 11.50 MB
Mean 2.01170 1.19144

Submission Checklist

  • README.md with detailed explanation
  • submission.json with seed results and metadata
  • train_gpt.py (self-contained, runs from records folder)
  • 3 training logs (seeds 1337, 42, 7) on 8xH100 SXM
  • All runs complete within 600s wallclock
  • Artifact under 16 MB (11.5 MB)
  • No external downloads during evaluation

Acknowledgments

Built with Claude (Anthropic) as AI pair programmer. Builds on techniques from notapplica, Matthew Li, samacqua, Spokane Way, Nan Liu, and Renier Velazco.

…al_bpb=1.1914)

Novel techniques:
- CLASE-inspired adaptive per-layer quantization (int8 boundary, int6 middle)
- Ramping weight decay during warmdown (0.02->0.08)
- 2048 training seq len + sliding window eval (stride=64)

3 seeds on 8xH100 SXM: 1.1926, 1.1896, 1.1921 (mean 1.1914)
Artifact: ~11.5 MB (well under 16 MB limit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 21, 2026
Compresses weight distributions during warmdown for cleaner
post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB).
QAT still enabled alongside.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 21, 2026
12.5MB compressed with 9 layers → room for 10th layer.
Top PRs (openai#287, openai#309) use 10-11 layers for better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant