Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean) by JoeProAI · Pull Request #505 · openai/parameter-golf

JoeProAI · 2026-03-23T05:10:22Z

SwiGLU + VE128 + U-Net Skip Gates (No TTT)

val_bpb (3-seed mean): 1.11807945 (std: 0.000836, best: 1.11691774)

Seed	val_bpb
42	1.11885136
123	1.11691774
7	1.11846924

Architecture

11-layer transformer:

SwiGLU FFN with Star-ReLU activation (hidden=1792)
U-Net Skip Gates: 5 encoder, 6 decoder with learned gating
XSA4: Extended Self-Attention in last 4 layers
Value Embeddings (VE128): 128-dim shared embedding, per-layer scales (layers 9-10)
BigramHash: 8192 buckets, 128-dim
EMA (decay=0.997)
Partial RoPE (16 dims)
LN Scale, Late QAT@0.15
Int6 + GPTQ-lite + zstd-22

Training Configuration

Sequence length: 2048 (key finding: +0.008 bpb over seq_len=1024)
Batch tokens: 786,432
Warmdown: 3,500 steps
8xH100 SXM (Modal), ~18 min/seed

Key Ablation: Sequence Length

Config	val_bpb
Full arch, seq_len=1024	1.12670
Full arch, seq_len=2048	1.11808
Improvement	-0.00862

Provenance

All architectural components discovered through systematic ablation search and Codex-guided exploration. Value Embeddings adapted from community. No test-time training.

Disclosure

This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).

Compute

~18 min/seed on 8xH100 SXM (Modal). Three seeds for verification.

Non-TTT submission: 3-seed mean 1.11807945 (std 0.000836) Seed 42: 1.11885136 Seed 123: 1.11691774 Seed 7: 1.11846924 Architecture: 11L SwiGLU (Star-ReLU), U-Net skip gates, XSA4, VE128 (layers 9-10), BigramHash 8192, EMA 0.997, seq_len=2048, batch=786K, warmdown=3500 Key finding: seq_len 2048 yields -0.008 bpb vs 1024. No test-time training.

Star-ReLU: learnable per-channel scale+bias on relu² activation (MetaFormer). Same architecture as PR openai#505's "SwiGLU" — 2 weight matrices, not gated MLP. Zero step time overhead, ~34K params (66KB fp16). TrigramHash: 3-token xor hash embedding extending BigramHash to trigram context. 4096 buckets, 32-dim, ~147K params (108KB int6). Independent contribution. BigramHash doubled to 4096 buckets (from 2048) for less collision. All features env-var controlled and default ON. Artifact headroom: ~466KB remaining (well within 16MB cap).

Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts

Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.

himanalot · 2026-03-23T15:26:32Z

uh i ran this and got like 20 mb artifact but idk bro maybe smths diff on my machine?

Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash

…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).

LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).

cocohearts · 2026-03-23T17:33:43Z

pls add more comments / stuff on ur pr, train logs, submission.json

JoeProAI · 2026-03-24T13:09:00Z

Hey all -- himanalot is right. We ran validation but miscalculated the artifact size before submitting -- thought we were under 16MB, turned out we weren't. Apologies to cocohearts and anyone who spent time on this because of that mistake.

Working on a corrected submission now. Will update with a valid artifact, train logs, and submission.json once it's clean.

…alid - New artifact: int5 per-row quantization + zstd-22, 14,549,183 bytes (13.88 MB) - val_bpb: 1.14802544 (float: 1.1519, quant degradation: 0.0039) - Architecture: GEPA 11L, LeakyReLU(0.5)^2, XSA-all, mlp_hidden=1792 - Training: 7870 steps, 600s wallclock, matrix_lr=0.025, warmdown=6000 - Hardware: 8xH100 - Added submission.json and train logs per cocohearts request

JoeProAI · 2026-03-25T04:50:44Z

Update (March 25, 2026): Corrected submission pushed to branch.

New artifact: 1.14802544 val_bpb, 13.88 MB (valid, under 16 MB limit)
Previous submission had an artifact size error (~20 MB). Apologies for that.
Added submission.json and train_logs/wave17_train_log.txt per @cocohearts request

Architecture unchanged (GEPA 11L). Key training changes from prior submission:

matrix_lr corrected to 0.025 (was 0.040)
warmdown 6000 steps (quant-friendly decay)
LeakyReLU(0.5)^2 activation
XSA on all 11 layers

Loss curve: 1.3603 → 1.2987 → 1.2669 → 1.2460 → 1.2270 → 1.2066 → 1.1780 → 1.1519 (float, 600s) → 1.14802544 (post-quant int5+zstd22)

JoeProAI · 2026-03-26T15:58:56Z

Superseded by PR #861 — corrected valid submission at 1.1326 BPB, 15.51 MB. Closing this one.

- Submission train_gpt.py with all 32 techniques from the execution plan, each gated by environment variables (disabled by default) - Optuna-based search framework with validate mode (per-technique smoke test) and search mode (TPE over joint technique + model size space) - Ablation infrastructure (ablation.py, shell scripts) for tracking experiments - PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738) - Execution plan document Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn, PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization (variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time (TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

cocohearts mentioned this pull request Mar 23, 2026

Update README leaderboard with merged record submissions #561

Merged

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

JoeProAI closed this Mar 26, 2026

ibarrajo mentioned this pull request Mar 28, 2026

Non-record: Three Approaches + Lessons Learned (best: 1.1188 BPB) #1001

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-ve128-nottt-1.1181

JoeProAI commented Mar 23, 2026 •

edited

Loading

Uh oh!

himanalot commented Mar 23, 2026 •

edited

Loading

Uh oh!

cocohearts commented Mar 23, 2026

Uh oh!

JoeProAI commented Mar 24, 2026

Uh oh!

JoeProAI commented Mar 25, 2026

Uh oh!

JoeProAI commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JoeProAI commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SwiGLU + VE128 + U-Net Skip Gates (No TTT)

Architecture

Training Configuration

Key Ablation: Sequence Length

Provenance

Disclosure

Compute

Uh oh!

himanalot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cocohearts commented Mar 23, 2026

Uh oh!

JoeProAI commented Mar 24, 2026

Uh oh!

JoeProAI commented Mar 25, 2026

Uh oh!

JoeProAI commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoeProAI commented Mar 23, 2026 •

edited

Loading

himanalot commented Mar 23, 2026 •

edited

Loading