Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) by rarce · Pull Request #534 · openai/parameter-golf

rarce · 2026-03-23T14:45:12Z

Summary

val_bpb: 1.1804 (post-quant, single seed) | 15.95 MB artifact | 8×H100 SXM, 615s

Non-record submission documenting systematic combination of PR #374 frontier techniques with MLP width optimization and GPTQ-lite quantization.

Key Techniques

Technique	Source	Impact
Partial RoPE (16/64 dims)	PR #315	Position-free 75% of head dims
LN Scale (1/sqrt(i+1))	PR #315	Damps deeper layers
XSA on last 4 layers	PR #265, #287	GQA-aware self-value debiasing
Shared VE128 (layers 9,10)	PR #374	Value embedding injection
Tight SWA (scale<0.2)	PR #374	Zero-penalty weight averaging
Late QAT (lr_scale<0.1)	PR #297	Avoids Muon momentum corruption
GPTQ-lite (clip search)	PR #379	Per-tensor optimal clip ratio
MLP hidden=1408	Novel	Faster steps → more training in 10min
Int6 layers 1-9 + int8 0,10	Reference	Mixed precision quantization
zstd-22	Standard	~35% better than zlib

Novel Contribution

MLP hidden=1408 vs 1536: Narrower MLP fits in 16MB while enabling 33% more training steps (137ms vs 178ms/step). The extra 1000 steps more than compensate for reduced per-step capacity:

MLP 1536: 3061 steps, val_bpb 1.1958, 18MB (over limit)
MLP 1408: 4071 steps, val_bpb 1.1804, 15.95MB (under limit)

Metrics

Metric	Value
Pre-quant val_bpb	1.1770
Post-quant val_bpb	1.1804
Quant gap	+0.0034
Steps	4,071 @ 137ms/step
Parameters	25,224,291
Artifact	15,949,473 bytes

Test plan

Artifact under 16MB (15.95MB)
Trains in 615s on 8×H100 SXM
Post-quant roundtrip verified
train_gpt.py compiles and runs from records/ folder
Train log included
Multi-seed validation (single seed, budget constrained)

git-subtree-dir: modded-nanogpt git-subtree-split: 1cafa69148f326110381c74ae458aca859f3a881

…nogpt'

git-subtree-dir: slowrun git-subtree-split: 767782103f3bddcc75508f1e5860585f60371df2

…challenges - docs/README.md: Problem definition, architecture, leaderboard analysis, key insights - docs/nanogpt-speedrun.md: 77-record progression analysis, 12 techniques with transferability assessment - docs/nanogpt-slowrun.md: 27-record analysis across 3 tracks, regularization and data-efficiency techniques - CLAUDE.md: Project guidance for Claude Code - papers/2001.08361.pdf: Neural Scaling Laws reference paper - .gitignore: Add .claude/settings.local.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Comprehensive analysis of techniques applicable to our ~22M param model: - Depth vs width debate (MobileLLM vs Depth Delusion paper) - Parameter sharing and depth recurrence (RingFormer, MobileLLM-LS) - Current PG frontier: PRs pushing to val_bpb ~1.13 with int6+QAT+SWA - Quantization: int6, QAT/STE, BitNet ternary, mixed precision - Optimizer advances: NorMuon, Polar Express, ROOT, CANS - MTP not recommended at sub-1B scale - Prioritized action plan across 3 tiers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add "Analisis de Envios" section contrasting merged SOTA (1.1748) vs PR frontier (~1.1318) - Document consensus stack: int6+QAT+SWA+11L+MLP3x+SmearGate - Add "Contraste: Investigacion vs Practica" with validated/surprise/pending findings - Replace speculative "Espacio de Exploracion" with evidence-based 3-tier R&D priorities - Add "No Recomendado" section (MTP, byte-level, MoE, dropout) with justifications - Update quantization pipeline to reflect int6+zstd frontier - Expand references with internal docs, papers, and community links Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add baseline/SOTA/frontier val_bpb scores at the top - Add Knowledge Base section pointing to docs/ analysis files - Document reference subtrees (modded-nanogpt, slowrun) - Add Frontier Technique Stack table (int6+QAT+SWA+11L+MLP3x consensus) - Update artifact description to reflect int6+zstd frontier Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Based on frontier PR analysis (docs/README.md), applies the competitive consensus stack to the MLX training script: Architecture: - 11 layers (up from 9) — sweet spot for 16MB budget under int6 - MLP 3× expansion (hidden=1536) — optimal SwiGLU ratio ~2.7× - Seq length 2048 (up from 1024) — 2× training context - SmearGate — learned gate blending current + previous token embeddings - Logit softcap 20 (down from 30) — better stability per speedrun findings Training: - Muon weight decay 0.038 — keeps weights small for quantization - Int6 QAT with STE — fake 6-bit quantization during training - SWA (Stochastic Weight Averaging) — averages weights during warmdown with 0.7/0.3 final/SWA blend (optimal from slowrun research) Quantization: - FP16 embedding passthrough — never quantize the tied embedding layer All features are env-var configurable (QAT_ENABLED, SWA_ENABLED, MUON_WEIGHT_DECAY, etc.) and can be disabled for A/B testing. Script stays under 1500-line cap at 1202 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int6 quantization stores weights as multiples of 4 in int8 containers, giving 64 levels (6-bit effective precision). zlib compresses the reduced entropy ~35% better than full int8, freeing artifact space for more params. - QUANT_BITS env var (default 6) controls quantization precision - Step=4 rounding in quantize_float_array for int6 mode - Dequantization correctly handles quant_step division - Format tag tracks quant precision (int6_clean_per_row_v1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirrors all consensus stack changes from train_gpt_mlx.py to the CUDA training script for RunPod deployment: Architecture: - 11 layers (NUM_LAYERS=11), MLP 3× (MLP_MULT=3), seq 2048 - SmearGate module (zero-init, learned per-dim gate) - Logit softcap 20 (LOGIT_SOFTCAP=20) Training: - Muon weight decay 0.038 (MUON_WEIGHT_DECAY) - Int6 QAT with STE (QAT_ENABLED=1, QAT_BITS=6) - SWA during warmdown (SWA_ENABLED=1, SWA_EVERY=50, blend 0.7/0.3) Quantization: - Int6 step=4 rounding (QUANT_BITS=6) — 64 levels in int8 container - FP16 embedding passthrough (FP16_EMBED_PASSTHROUGH=1) - quant_step metadata for correct dequantization All features env-var configurable. 1460 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pure bash script using RunPod REST API v1 for pod lifecycle management: - create: provision 1x or 8x H100 with Parameter Golf template - setup: clone repo + download FineWeb data on pod - run: launch training with configurable RUN_ID and extra env vars - fetch: download logs and artifacts to local machine - stop/terminate: manage pod lifecycle - ssh: direct SSH connection to pod Tracks pod ID in .runpod_pod_id for multi-command workflows. Uses RUNPOD_API_KEY from .env file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add our consensus stack config (26.5M params, 4.3MB artifact) - Add RunPod automation commands and scripts/runpod.sh reference - Update architecture section to reflect consensus stack in both scripts - Add consensus stack feature table with env vars and implementation status - Add experiment logs summary table - Update MLX command for seq_len=2048 compatibility - .gitignore: add .env to prevent API key leaks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

torch.load with weights_only=True (PyTorch default since 2.6) cannot deserialize the quant_step int stored in the quant dict, causing it to fall back to quant_step=1 and producing wrong dequantized weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The reference int6 implementation (2026-03-19_10L_MixedPrecision record) quantizes to int8 first, then rounds values to nearest multiple of step as a post-processing step. Dequantization is standard int8 (q * scale). Our previous approach changed the quantization scale (dividing by 31 instead of 127) and required special quant_step metadata for dequantization. This caused: 1. 4× scale mismatch producing +0.075 BPB degradation 2. Needed weights_only=False for torch.load (non-standard) 3. Incompatible with baseline dequantization Now both scripts match the reference: - quantize_float_tensor: standard int8 (scale = clip/127) - Post-hoc: round int8 values to multiples of 4 (64 levels) - dequantize: standard q.float() * scale (no quant_step division) - Removed weights_only=False from torch.load Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove startSsh (not a valid REST API field) - Fix JSON paths: API returns flat object, not nested .pod - SSH info from .publicIp + .portMappings["22"] (not .runtime.ports) - Auto-load .env file in script (set -a/+a) - Add .runpod_pod_id to .gitignore Tested: create, status, list, stop, terminate all working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The reference record (10L_MixedPrecision) applies int6 step=4 rounding only to middle layers (3-6), keeping first/last layers at full int8. Applying int6 to ALL layers caused +0.075 BPB degradation. Changes: - INT6_LAYERS env var (default "3,4,5,6,7") selects which layers get int6 - INT6_STEP env var (default 4) controls rounding step - First/last layers stay int8 (256 levels) for input/output quality - Middle layers use int6 (64 levels) for better compression - Reference achieved only +0.0018 BPB degradation with this approach Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Parameter Golf template creates an empty /workspace/parameter-golf directory. The setup script now detects dirs without .git and removes them before cloning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

approx_training_time_ms was referenced before assignment in the Late QAT block. Moved the check after step increment where the variable is computed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- deploy: all-in-one create+setup+run - tail: quick log check (10 lines) - done: fetch artifacts + terminate pod Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- docs/review_records_track_10min_16mb.md: Analysis of all 17 merged records with architecture, convergence curves, ablations, and novel contributions - docs/review_pr_records_track_10min_16mb.md: Analysis of top 20 open PRs with techniques, scores, and novel ideas not yet in merged records - docs/original_model.md: "RingGolf" proposal — depth recurrence with per-loop low-rank adapters, dim=576, gradient-guided mixed quant, backout connection. 16 effective layers from 8 unique blocks. 4-phase execution plan with fallback strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces flat num_layers with prelude/core/coda architecture: - Prelude (2 unique blocks): input processing, stored for U-Net skips - Core (4 shared blocks × 3 loops = 12 effective layers): depth recurrence - Coda (2 unique blocks): output processing, consumes prelude skips Effective depth: 2 + 12 + 2 = 16 layers from only 8 unique blocks. Model dim increased to 576 (from 512) using saved parameter budget. Loop iteration embeddings tell core blocks which pass they're on. Config: PRELUDE_LAYERS=2 CORE_LAYERS=4 CODA_LAYERS=2 RECURRENCE_LOOPS=3 Set RECURRENCE_LOOPS=1 to disable recurrence (8 effective layers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Issues fixed: 1. Removed self.blocks ModuleList that duplicated prelude/core/coda params in state_dict (32MB -> ~15MB artifact). Now uses @Property. 2. Unrolled core recurrence loop statically (3 explicit calls) for torch.compile fullgraph compatibility and better optimization. 3. Fixed int6 quantization to target "core." keys instead of "blocks." 4. Added loop_embeds to scalar optimizer params. 5. Factored _run_core() helper for the unrolled loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full SOTA reproduction stack with novel additions: Architecture: - Partial RoPE (16/64 dims) — position-free attention on 75% of dims - LN Scale (1/sqrt(layer+1)) — damp deeper layers - XSA on last 4 layers — GQA-aware orthogonal self-value debiasing - Shared Value Embedding (dim=128, layers 9,10) — 1 table, per-layer scales - SmearGate, BigramHash (existing) Training: - Tight SWA (scale<0.2) — only average last ~600 steps, zero penalty - Late QAT (existing) - Muon WD=0.038, logit softcap=30 Post-training: - GPTQ-lite: per-tensor clip ratio search (5 candidates) minimizing reconstruction error. Zero training cost. Eval-time (NOVEL): - PPM-C context mixer: order-2 per-document n-gram model mixed with neural log-probs at alpha=0.95. Zero artifact cost, ~60 LOC. 1325 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From clean upstream base, added: - Hyperparams: 11L, MLP3x, seq2048, batch786K, Muon 0.99, WD=0.04, Partial RoPE, LN Scale, XSA, VE, Tight SWA, Late QAT, GPTQ-lite - Modules: SmearGate, SharedValueEmbedding, fake_quantize, CastedLinear+QAT - Partial RoPE in Rotary + apply_rotary_emb TODO: CausalSelfAttention (XSA+VE), Block (LN Scale), GPT (wire all), Muon WD, training loop (SWA, Late QAT, EMA), quantization (int6, GPTQ) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean rewrite from upstream base with full SOTA stack: Architecture: 11L, MLP3x, SmearGate, SharedVE128 (layers 9,10), Partial RoPE (16/64 dims), LN Scale (1/sqrt(i+1)), XSA4 (GQA-aware), U-Net skips, logit softcap 30, tied embeddings. Training: Muon lr=0.025 momentum=0.99 WD=0.04, batch 786K, seq 2048, warmdown 3000, grad_clip 0.3. Late QAT (STE int6 when lr_scale<0.1). Tight SWA (scale<0.2, every 50 steps, uniform average). Quantization: GPTQ-lite (5-point clip search per tensor), int6 step=4 on middle layers (3-7), FP16 embedding passthrough. GPT class simplified to take Hyperparameters directly. 1172 lines (under 1500 cap). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SharedValueEmbedding and ve_scales only participate in forward for layers 9,10. Other layers' ve_scales get no gradient, causing DDP rebuild_buckets error. find_unused_parameters=True resolves this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zstd level 22 provides ~35% better compression than zlib-9 on int6 data. This should bring our 16.66 MB artifact under the 16 MB cap. Requires: pip install zstandard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- val_bpb 1.1958 post-quant (competitive, needs sliding eval) - Degradation +0.0007 (best ever, Tight SWA + GPTQ-lite work) - Artifact 18 MB > 16 MB limit — needs int6-all + pruning - Full convergence curve and experiment log table - Next step: fix artifact size, add sliding eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous run: 18 MB artifact with int6 only on middle layers. Fix: apply int6 step=4 rounding to ALL block and VE weights (not just layers 3-7). Additionally prune smallest 10% of weights to zero for better zstd compression. PR openai#389 validates this approach (~500KB savings). Expected: 18 MB → ~15 MB (int6-all saves ~1.5 MB, pruning saves ~500KB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces aggressive int6-all + 10% pruning with targeted approach: - MLP_HIDDEN=1408 (vs 1536): saves ~1.44M params (~1MB compressed) Following PR openai#332 which uses 1408 for its 12-layer model - Int6 on layers 1-9, keep layer 0 and 10 at int8 (input/output quality) - No magnitude pruning (preserves model quality) Expected artifact: ~15.5 MB (down from 18 MB) MLP_HIDDEN env var overrides mlp_mult*dim when > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match CUDA changes: MLP_HIDDEN=1408 default, int6 rounding on layers 1-9 (keep 0 and 10 at int8). MLX smoke confirms 25.2M params and 5.14 MB artifact (down from 6.7 MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Usage: ./scripts/runpod.sh create 8 COMMUNITY COMMUNITY = spot instances (cheaper, may be preempted) SECURE = on-demand (default, guaranteed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Submission combining PR openai#374 frontier techniques with MLP width optimization and GPTQ-lite clip search: - 11L/512d, MLP hidden=1408, 25.2M params - Partial RoPE (16/64), LN Scale, XSA4, Shared VE128 - Tight SWA (scale<0.2), Late QAT (lr_scale<0.1) - GPTQ-lite per-tensor clip search (5 candidates) - Int6 layers 1-9 + int8 layers 0,10 + FP16 embed - zstd-22 compression → 15.95MB artifact - 4071 steps @ 137ms/step on 8×H100 SXM val_bpb: 1.1804 (single seed 1337) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add requirements.txt (zstandard dependency) - Fix DATA_PATH/TOKENIZER_PATH to use relative paths from records/ - Clarify non-record status and novel contribution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rarce · 2026-03-23T15:36:35Z

Closing: PR included unrelated files. Will resubmit with clean branch.

rarce and others added 30 commits March 20, 2026 11:00

Squashed 'modded-nanogpt/' content from commit 1cafa69

6818cb1

git-subtree-dir: modded-nanogpt git-subtree-split: 1cafa69148f326110381c74ae458aca859f3a881

Merge commit '6818cb1272096e12f2d1641a595c9a02e9833d49' as 'modded-na…

e6f175f

…nogpt'

Squashed 'slowrun/' content from commit 7677821

68078b7

git-subtree-dir: slowrun git-subtree-split: 767782103f3bddcc75508f1e5860585f60371df2

Merge commit '68078b7590c21c866f86b44fe9be644e1d775597' as 'slowrun'

96d1aa4

Fix runpod.sh setup: handle empty dir from template

836c152

The Parameter Golf template creates an empty /workspace/parameter-golf directory. The setup script now detects dirs without .git and removes them before cloning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix runpod.sh run: create logs/ dir before tee

15f730b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix Late QAT: move after approx_training_time_ms is computed

6de7803

approx_training_time_ms was referenced before assignment in the Late QAT block. Moved the check after step increment where the variable is computed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add deploy/tail/done shortcuts to runpod.sh

16e3f90

- deploy: all-in-one create+setup+run - tail: quick log check (10 lines) - done: fetch artifacts + terminate pod Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/main'

eed689e

Merge branch 'main' into sota-review

9e92db8

Merge remote-tracking branch 'upstream/main'

0b34042

Merge branch 'main' into sota-review

ad0d82b

rarce and others added 12 commits March 22, 2026 20:02

MLX: MLP hidden 1408 + int6 on layers 1-9

46241c8

Match CUDA changes: MLP_HIDDEN=1408 default, int6 rounding on layers 1-9 (keep 0 and 10 at int8). MLX smoke confirms 25.2M params and 5.14 MB artifact (down from 6.7 MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add spot instance support to runpod.sh

dcb70a2

Usage: ./scripts/runpod.sh create 8 COMMUNITY COMMUNITY = spot instances (cheaper, may be preempted) SECURE = on-demand (default, guaranteed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rarce closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#534

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#534
rarce wants to merge 42 commits intoopenai:mainfrom
rarce:sota-review

rarce commented Mar 23, 2026

Uh oh!

rarce commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rarce commented Mar 23, 2026

Summary

Key Techniques

Novel Contribution

Metrics

Test plan

Uh oh!

rarce commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant