XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB by teddyoweh · Pull Request #1092 · openai/parameter-golf

teddyoweh · 2026-03-29T19:29:24Z

Results

val_bpb: 1.1219 | Artifact: 15,916,230 bytes (15.92 MB) | 8×H100 SXM

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT time	Artifact
1337	93.97ms	6,173	1.1252	1.1219	-0.0033	464s	15,916,230

What's New

Three independently validated improvements on top of the PR #414 + PR #399 stack:

1. XSA on All 11 Layers (`XSA_LAST_N=11`)

Extending eXtended Self-Attention from last 4 layers to all 11 yields -0.0007 BPB. The richer attention outweighs ~4% slower step time (93.97ms vs ~90ms).

2. LeakyReLU(0.75)²

Higher negative slope than the current SOTA (0.75 vs 0.5). From PR #977's ablation, 0.75 is strictly better than 0.5 for the int6 stack. Preserves more gradient flow through the MLP.

x = F.leaky_relu(self.fc(x), negative_slope=0.75).square()

3. Aggressive Legal TTT (lr=0.03)

Score-first TTT using PR #461's legal framework with a 15× higher learning rate (0.03 vs 0.002). Delivers -0.0033 BPB improvement (vs -0.0025 in SOTA). All blocks unfrozen, SGD with momentum 0.9, 3 epochs per chunk, cosine LR decay.

torch.inference_mode() guarantees scoring is stateless — weights are only updated AFTER the chunk is scored.

FA3 Fallback

Script includes automatic fallback from Flash Attention 3 to PyTorch SDPA:

try:
    from flash_attn_interface import flash_attn_func as flash_attn_3_func
    _HAS_FA3 = True
except ImportError:
    _HAS_FA3 = False

Our run used SDPA (93.97ms/step → 6,173 steps). With FA3 (~84ms/step → ~7,100 steps), expected BPB would be in the 1.119x range.

Timing

Phase	Time
Training	580s
Eval (Legal TTT sliding)	464s
Total	< 20 min

Run Command

BIGRAM_VOCAB_SIZE=2048 TRIGRAM_VOCAB_SIZE=0 \
XSA_LAST_N=11 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.03 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=580 EVAL_STRIDE=64 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 (@parinzee), PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 (@sofiabod)
LeakyReLU(0.75): PR LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean) #977 (@awilliea)
XSA: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 (@Christopher-Lee-McClendon)
Parameter Banking + Parallel Muon: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 (@abaybektursun)
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)

Christopher-Lee-McClendon · 2026-03-31T15:10:24Z

Excellent combination of tweaks that synergize with more aggressive TTT. I'm surprised that the 15x learning rate was better, nice finding!

teddyoweh added 3 commits March 29, 2026 15:27

Add submission README.md

4a90c3b

Add submission.json

f649b89

Add train_gpt.py submission script

86f53b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092
teddyoweh wants to merge 3 commits intoopenai:mainfrom
teddyoweh:submission/xsa11-leakyrelu075-legalttt

teddyoweh commented Mar 29, 2026

Uh oh!

Christopher-Lee-McClendon commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teddyoweh commented Mar 29, 2026

Results

What's New

1. XSA on All 11 Layers (XSA_LAST_N=11)

2. LeakyReLU(0.75)²

3. Aggressive Legal TTT (lr=0.03)

FA3 Fallback

Timing

Run Command

Credits

Uh oh!

Christopher-Lee-McClendon commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. XSA on All 11 Layers (`XSA_LAST_N=11`)