Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) by jfprincz · Pull Request #315 · openai/parameter-golf

jfprincz · 2026-03-21T06:10:31Z

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

	PR #70	PR #164	PR #198	PR #287	This	Delta vs #287
val_bpb (sliding)	1.1659 (s256)	1.1524 (s256)	1.1318 (s64)	1.1271 (s64)	1.1248 (s64)	-0.0023
Layers	9	9	11	11	11	—
Params	21.8M	22.4M	26.8M	26.8M	26.8M	—
Artifact	14.9 MB	15.4 MB	15.7 MB	15.5 MB	15.6 MB	+0.1 MB

Three new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change	Impact
Partial RoPE (16 of 64 dims)	Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale	RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.
Late QAT	STE int6 fake-quantization enabled only in final ~4% of training (lr_scale < 0.1). Cuts int6 degradation by 3x with no cost to pre-quant quality. Inspired by arXiv:2505.14302.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric	Value
Pre-quant val_bpb	1.1418
Int6 roundtrip val_bpb	1.1485
Int6 sliding val_bpb (s64)	1.1248
Steps completed (600s cap)	7,051
Step time	85ms
Model params	26,829,913
Artifact size	15,612,308 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s64	Artifact
2025	7,051	1.1248	15,612,308
42	7,061	1.1250	15,528,666
1337	7,063	1.1253	15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

… 1.1248)

himanalot · 2026-03-21T06:23:41Z

yes! great job this is sort of where i went too

- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100

Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.

…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb:…

dfb05a5

… 1.1248)

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026

PR openai#315 base + OLB + deployment ready

54e6e37

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026

update README and submission to match PR openai#315 format

1707fb6

saml212 mentioned this pull request Mar 21, 2026

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332

Open

alertcat mentioned this pull request Mar 21, 2026

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338

Open

sheeki03 mentioned this pull request Mar 21, 2026

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)#315

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)#315
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

jfprincz commented Mar 21, 2026

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jfprincz commented Mar 21, 2026