Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505
Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505JoeProAI wants to merge 2 commits intoopenai:mainfrom
Conversation
Non-TTT submission: 3-seed mean 1.11807945 (std 0.000836) Seed 42: 1.11885136 Seed 123: 1.11691774 Seed 7: 1.11846924 Architecture: 11L SwiGLU (Star-ReLU), U-Net skip gates, XSA4, VE128 (layers 9-10), BigramHash 8192, EMA 0.997, seq_len=2048, batch=786K, warmdown=3500 Key finding: seq_len 2048 yields -0.008 bpb vs 1024. No test-time training.
Star-ReLU: learnable per-channel scale+bias on relu² activation (MetaFormer). Same architecture as PR openai#505's "SwiGLU" — 2 weight matrices, not gated MLP. Zero step time overhead, ~34K params (66KB fp16). TrigramHash: 3-token xor hash embedding extending BigramHash to trigram context. 4096 buckets, 32-dim, ~147K params (108KB int6). Independent contribution. BigramHash doubled to 4096 buckets (from 2048) for less collision. All features env-var controlled and default ON. Artifact headroom: ~466KB remaining (well within 16MB cap).
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts
Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.
|
uh i ran this and got like 20 mb artifact but idk bro maybe smths diff on my machine? |
Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR #505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash
…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
|
pls add more comments / stuff on ur pr, train logs, submission.json |
|
Hey all -- himanalot is right. We ran validation but miscalculated the artifact size before submitting -- thought we were under 16MB, turned out we weren't. Apologies to cocohearts and anyone who spent time on this because of that mistake. Working on a corrected submission now. Will update with a valid artifact, train logs, and submission.json once it's clean. |
…alid - New artifact: int5 per-row quantization + zstd-22, 14,549,183 bytes (13.88 MB) - val_bpb: 1.14802544 (float: 1.1519, quant degradation: 0.0039) - Architecture: GEPA 11L, LeakyReLU(0.5)^2, XSA-all, mlp_hidden=1792 - Training: 7870 steps, 600s wallclock, matrix_lr=0.025, warmdown=6000 - Hardware: 8xH100 - Added submission.json and train logs per cocohearts request
|
Update (March 25, 2026): Corrected submission pushed to branch.
Architecture unchanged (GEPA 11L). Key training changes from prior submission:
Loss curve: 1.3603 → 1.2987 → 1.2669 → 1.2460 → 1.2270 → 1.2066 → 1.1780 → 1.1519 (float, 600s) → 1.14802544 (post-quant int5+zstd22) |
|
Superseded by PR #861 — corrected valid submission at 1.1326 BPB, 15.51 MB. Closing this one. |
- Submission train_gpt.py with all 32 techniques from the execution plan, each gated by environment variables (disabled by default) - Optuna-based search framework with validate mode (per-technique smoke test) and search mode (TPE over joint technique + model size space) - Ablation infrastructure (ablation.py, shell scripts) for tracking experiments - PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738) - Execution plan document Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn, PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization (variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time (TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SwiGLU + VE128 + U-Net Skip Gates (No TTT)
val_bpb (3-seed mean): 1.11807945 (std: 0.000836, best: 1.11691774)
Architecture
11-layer transformer:
Training Configuration
Key Ablation: Sequence Length
Provenance
All architectural components discovered through systematic ablation search and Codex-guided exploration. Value Embeddings adapted from community. No test-time training.
Disclosure
This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).
Compute
~18 min/seed on 8xH100 SXM (Modal). Three seeds for verification.