Skip to content

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505

Closed
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-ve128-nottt-1.1181
Closed

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-ve128-nottt-1.1181

Conversation

@JoeProAI
Copy link
Copy Markdown

@JoeProAI JoeProAI commented Mar 23, 2026

SwiGLU + VE128 + U-Net Skip Gates (No TTT)

val_bpb (3-seed mean): 1.11807945 (std: 0.000836, best: 1.11691774)

Seed val_bpb
42 1.11885136
123 1.11691774
7 1.11846924

Architecture

11-layer transformer:

  • SwiGLU FFN with Star-ReLU activation (hidden=1792)
  • U-Net Skip Gates: 5 encoder, 6 decoder with learned gating
  • XSA4: Extended Self-Attention in last 4 layers
  • Value Embeddings (VE128): 128-dim shared embedding, per-layer scales (layers 9-10)
  • BigramHash: 8192 buckets, 128-dim
  • EMA (decay=0.997)
  • Partial RoPE (16 dims)
  • LN Scale, Late QAT@0.15
  • Int6 + GPTQ-lite + zstd-22

Training Configuration

  • Sequence length: 2048 (key finding: +0.008 bpb over seq_len=1024)
  • Batch tokens: 786,432
  • Warmdown: 3,500 steps
  • 8xH100 SXM (Modal), ~18 min/seed

Key Ablation: Sequence Length

Config val_bpb
Full arch, seq_len=1024 1.12670
Full arch, seq_len=2048 1.11808
Improvement -0.00862

Provenance

All architectural components discovered through systematic ablation search and Codex-guided exploration. Value Embeddings adapted from community. No test-time training.

Disclosure

This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).

Compute

~18 min/seed on 8xH100 SXM (Modal). Three seeds for verification.

Non-TTT submission: 3-seed mean 1.11807945 (std 0.000836)
  Seed 42:  1.11885136
  Seed 123: 1.11691774
  Seed 7:   1.11846924

Architecture: 11L SwiGLU (Star-ReLU), U-Net skip gates, XSA4,
VE128 (layers 9-10), BigramHash 8192, EMA 0.997,
seq_len=2048, batch=786K, warmdown=3500

Key finding: seq_len 2048 yields -0.008 bpb vs 1024.
No test-time training.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Star-ReLU: learnable per-channel scale+bias on relu² activation (MetaFormer).
Same architecture as PR openai#505's "SwiGLU" — 2 weight matrices, not gated MLP.
Zero step time overhead, ~34K params (66KB fp16).

TrigramHash: 3-token xor hash embedding extending BigramHash to trigram context.
4096 buckets, 32-dim, ~147K params (108KB int6). Independent contribution.

BigramHash doubled to 4096 buckets (from 2048) for less collision.

All features env-var controlled and default ON.
Artifact headroom: ~466KB remaining (well within 16MB cap).
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887):
- train_batch_tokens: 524K → 786K (all top entries use this)
- bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240)
- grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping)
- Star-ReLU and TrigramHash enabled in all run scripts
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Sigmoid skip gates (PR openai#505): replace additive skip connections with
sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim
gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on).

Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get
DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits.
Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr.

Both features are env-var controlled and default ON.
@himanalot
Copy link
Copy Markdown

himanalot commented Mar 23, 2026

uh i ran this and got like 20 mb artifact but idk bro maybe smths diff on my machine?

newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Upgrades train_gpt_swiglu.py with every proven technique for max quality:
- seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB)
- LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow)
- VRL: sigmoid-gated first-block mixing into attention input
- Legal score-first TTT ported from v7 (disabled by default)
- int8 GPTQ for attn.proj (lower quant tax on sensitive layers)
- grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500
- All illegal TTT remains purged. Score-first only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt.
Those techniques are from PR openai#505 which has a different architecture (kv8, h1792).
They don't transfer to our kv4/h1536 setup.

Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3
Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
…iques

Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792
config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236.

Novel technique analysis: DG Attention (differential values), BitNet b1.58
(ternary weights + depth recurrence), arithmetic coding (replaces zstd-22),
LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2.
Addresses dead neuron problem. LEAKY_RELU=1 env var.

run_no_ttt_best.sh: run3 base + three free lunches:
  - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB)
  - LeakyReLU(0.5)^2 (zero params, -0.003 BPB)
  - QAT=1 (run5 proved negative quant gap)

Drops sigmoid gates and decoder 2x LR (run6 showed they hurt).
Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
@cocohearts
Copy link
Copy Markdown
Collaborator

pls add more comments / stuff on ur pr, train logs, submission.json

@JoeProAI
Copy link
Copy Markdown
Author

Hey all -- himanalot is right. We ran validation but miscalculated the artifact size before submitting -- thought we were under 16MB, turned out we weren't. Apologies to cocohearts and anyone who spent time on this because of that mistake.

Working on a corrected submission now. Will update with a valid artifact, train logs, and submission.json once it's clean.

…alid

- New artifact: int5 per-row quantization + zstd-22, 14,549,183 bytes (13.88 MB)
- val_bpb: 1.14802544 (float: 1.1519, quant degradation: 0.0039)
- Architecture: GEPA 11L, LeakyReLU(0.5)^2, XSA-all, mlp_hidden=1792
- Training: 7870 steps, 600s wallclock, matrix_lr=0.025, warmdown=6000
- Hardware: 8xH100
- Added submission.json and train logs per cocohearts request
@JoeProAI
Copy link
Copy Markdown
Author

Update (March 25, 2026): Corrected submission pushed to branch.

  • New artifact: 1.14802544 val_bpb, 13.88 MB (valid, under 16 MB limit)
  • Previous submission had an artifact size error (~20 MB). Apologies for that.
  • Added submission.json and train_logs/wave17_train_log.txt per @cocohearts request

Architecture unchanged (GEPA 11L). Key training changes from prior submission:

  • matrix_lr corrected to 0.025 (was 0.040)
  • warmdown 6000 steps (quant-friendly decay)
  • LeakyReLU(0.5)^2 activation
  • XSA on all 11 layers

Loss curve: 1.3603 → 1.2987 → 1.2669 → 1.2460 → 1.2270 → 1.2066 → 1.1780 → 1.1519 (float, 600s) → 1.14802544 (post-quant int5+zstd22)

@JoeProAI JoeProAI closed this Mar 26, 2026
@JoeProAI
Copy link
Copy Markdown
Author

Superseded by PR #861 — corrected valid submission at 1.1326 BPB, 15.51 MB. Closing this one.

MichaelMcCulloch pushed a commit to MichaelMcCulloch/parameter-golf that referenced this pull request Mar 29, 2026
- Submission train_gpt.py with all 32 techniques from the execution plan,
  each gated by environment variables (disabled by default)
- Optuna-based search framework with validate mode (per-technique smoke test)
  and search mode (TPE over joint technique + model size space)
- Ablation infrastructure (ablation.py, shell scripts) for tracking experiments
- PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738)
- Execution plan document

Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn,
PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization
(variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time
(TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants