Skip to content

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)#315

Open
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248
Open

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)#315
jfprincz wants to merge 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

Conversation

@jfprincz
Copy link

Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

PR #70 PR #164 PR #198 PR #287 This Delta vs #287
val_bpb (sliding) 1.1659 (s256) 1.1524 (s256) 1.1318 (s64) 1.1271 (s64) 1.1248 (s64) -0.0023
Layers 9 9 11 11 11
Params 21.8M 22.4M 26.8M 26.8M 26.8M
Artifact 14.9 MB 15.4 MB 15.7 MB 15.5 MB 15.6 MB +0.1 MB

Three new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change Impact
Partial RoPE (16 of 64 dims) Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.
Late QAT STE int6 fake-quantization enabled only in final ~4% of training (lr_scale < 0.1). Cuts int6 degradation by 3x with no cost to pre-quant quality. Inspired by arXiv:2505.14302.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric Value
Pre-quant val_bpb 1.1418
Int6 roundtrip val_bpb 1.1485
Int6 sliding val_bpb (s64) 1.1248
Steps completed (600s cap) 7,051
Step time 85ms
Model params 26,829,913
Artifact size 15,612,308 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s64 Artifact
2025 7,051 1.1248 15,612,308
42 7,061 1.1250 15,528,666
1337 7,063 1.1253 15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@himanalot
Copy link

yes! great job this is sort of where i went too

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 21, 2026
- Add FA3 > FA2 > SDPA attention backend dispatch
- FA2 wrapper uses @torch.compiler.disable + fullgraph=False
- FA3 uses fullgraph=True (compatible with torch.compile)
- Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315)
- Add pod_setup.sh with FA3/FA2 install flow
- Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
Rename folder to today's date. Replace train_gpt.py with the new
baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT,
1.1248 BPB). Previous script preserved as previous_train_gpt.py.
Update README with PR lineage and new baseline context.
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
…unner

Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315
train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB).
Update run script to use PR openai#315 config for both baseline and experiment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants