Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) by anthony-maio · Pull Request #657 · openai/parameter-golf

anthony-maio · 2026-03-24T23:20:30Z

Summary

val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	val_bpb	Artifact
1337	87.1ms	6,889	1.1234	15,887,926
42	88.0ms	6,818	1.1225	15,877,570
2025	87.5ms	6,857	1.1228	15,890,566
Mean	87.5ms	6,855	1.1229 (std 0.0005)

All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovations

LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.

Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.

lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.

Architecture

PR #414 base + LeakyReLU² + VRL + lzma:

Component	Details
Layers	11L, 512d, 8H/4KV (GQA), U-Net skips (5 enc, 6 dec)
MLP	3× expansion (1536), LeakyReLU(0.5)² activation
Attention	XSA4, Partial RoPE 16/64, LN Scale 1/√(i+1), VRL
Embeddings	BigramHash(2048), VE128 (layers 9-10), SmearGate
Training	EMA(0.997) + Tight SWA, Late QAT (STE@0.15), OrthoInit
Optimizer	Muon WD=0.04, warmdown=3500, batch=786K tokens
Quantization	GPTQ-lite int6 + lzma (preset=6)
Attention kernel	FlashAttention 3 (Hopper native)

Credits

Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
VRL: ResFormer (arXiv:2410.17897), PR Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 by @gowtham0992
XSA: PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 by @jfprincz
Competition infrastructure: OpenAI, RunPod

Test plan

Seed 1337: 1.1234 bpb, 15.89MB valid
Seed 42: 1.1225 bpb, 15.88MB valid
Seed 2025: 1.1228 bpb, 15.89MB valid
3-seed mean: 1.1229, std 0.0005
All 3 train logs attached
All artifacts under 16,000,000 bytes

🤖 Generated with Claude Code

Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RunPod parameter-golf template doesn't have flash-attn pre-installed. Falls back to F.scaled_dot_product_attention with GQA expansion. Slower (~120ms vs 84ms) but functional for testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Hopper interface is at flash_attn.flash_attn_interface, not flash_attn_interface (top-level). Added to the import chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…RIDE P1: TTT was running on the pre-quantization base_model instead of the int6 round-tripped eval_model. This overstated TTT gains since the artifact model has quantization noise. Now matches PR openai#473's approach. P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now honors the configured stride so TTT results stay consistent with the sliding window eval path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Saves layer 0's raw V output and blends it into all subsequent layers via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing). PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT). VRL is orthogonal to our existing VE128 (shared value embedding). Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer (10 params total for 11L). Zero compute overhead beyond the gated blend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Saves 19.6KB of code size toward fitting under 16MB artifact limit. Model binary (16.08MB) + code (60KB) = 16.14MB, still 139KB over. Next step: reduce model size via tighter GPTQ-lite or smaller MLP. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1472 % 128 == 0, perfect for H100 tensor cores. Saves ~720K params (65K per block × 11), ~324KB compressed. Should bring artifact from 16.16MB to ~15.8MB, under the 16MB cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 42 produced 16.72MB artifact (over 16MB limit) despite seed 1337 fitting at 15.45MB. Two changes to ensure all seeds fit: - bigram_vocab_size 2048→1536: saves ~192KB (64K fewer hash buckets) - Control tensors (attn_scale, mlp_scale, resid_mix, etc.) stored as fp16 instead of fp32: saves ~57KB. Dequant path already handles fp16→fp32 upcast. Combined savings ~250KB should keep all seeds under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Key insight from council: PR openai#549 uses lzma, not zstd. lzma is stdlib (no pip install needed!) and compresses 2-5% tighter on quantized weights. This recovers ~300-800KB headroom, enough to restore: - mlp_mult: 2.875 → 3.0 (recover ~0.001-0.002 bpb) - bigram_vocab_size: 1536 → 2048 (recover ~0.001 bpb) - lzma.compress(data, preset=6) replaces zstd-22/zlib Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 1337: 1.1234 bpb, 15.89MB artifact, 87.1ms/step, 6889 steps. Seeds 42 and 2025 running — logs will be added when complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new 10min/16MB record entry implementing LeakyReLU(0.5)² + Value Residual Learning (VRL) and switching artifact compression to stdlib lzma, along with a companion “reproduce #414” entry for comparison.

Changes:

New record training script and metadata for 2026-03-24_LeakyReLU2_VRL_LZMA (VRL + LeakyReLU² + lzma-compressed int6 artifact).
Added README documenting the approach and reproduction steps.
Added a 2026-03-23_Reproduce414_LegalTTT folder with a matching training script and placeholder submission metadata.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py	Implements VRL gates and lzma-compressed int6 export + eval/roundtrip.
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/submission.json	Captures record metadata (val_bpb, sizes, blurb).
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/README.md	Describes innovations and provides reproduction instructions.
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py	Baseline/repro script (same core training/export pipeline).
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json	Placeholder metadata for the repro entry (currently null fields).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py

records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/README.md

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json

records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8ee0dc3b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json

Seed 1337: 1.1234 bpb, 15.89MB Seed 42: 1.1225 bpb, 15.88MB Seed 2025: 1.1228 bpb, 15.89MB Mean: 1.1229 (std 0.0005) All 3 artifacts under 16MB. All logs attached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove misleading 'int8+zlib' log labels, use correct int6+compressor - Remove unused late_k_layers variable in quantization - Fix submission.json null fields with actual values - All changes in both 2026-03-23 and 2026-03-24 record folders Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Layer 0's V output is blended 50/50 into all subsequent layers' V. Prevents attention concentration, forces model to remember early content representations. Zero extra params, minimal speed cost. Proven in competition PR openai#657 (1.1229 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR should only add records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/ Removed: - records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/ (duplicate submission) - .private/substack_draft_notes.md (not part of submission) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extra ~177KB compressed. Current artifact at 15.89MB, cap at 16MB. Tight but should fit — bigram weights are highly compressible (initialized near zero). Will validate on next run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Input-dependent gate: sigmoid(Linear(x)) applied per-head after SDPA. Init: weight=zeros, bias=4.0 (sigmoid(4)≈0.98, near-identity start). Eliminates attention sinks. ~0.002-0.003 bpb gain per PR openai#638 ablation. Stacks additively with VRL (combined: -0.017 in 9L ablation). ~45K params total (negligible). attn_gate added to control tensor patterns. Enabled by default (GA_ENABLED=1). Credit: PR openai#638, arXiv:2505.06708. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

10-line training-time regularization that pushes weights toward flat minima where int6 rounding does less damage. Penalty per row: lambda * mean(w²) * (row_max/15)² / 12 Over-penalizes (uses /15 vs actual /31) for extra margin. Active only when QAT is enabled (warmdown phase). Zero eval cost. Fully legal per issue openai#677 (training-time only). CROWNQ_LAMBDA=0.01 (default). Credit: PR openai#693. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…armdown" This reverts commit 5e0e793.

… output" This reverts commit 40ec415.

This reverts commit 08395d8.

anthony-maio · 2026-03-25T20:16:50Z

Superseded by PR #175 (earlier timestamp, same results).

- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB - Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB - Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1 - Added attn_gate, lambda_v to control tensor patterns for proper quantization handling - All smoke tests pass on CPU

anthony-maio and others added 11 commits March 22, 2026 21:03

Fix FA3 import: flash_attn.flash_attn_interface in FA 2.8.3

c3442df

The Hopper interface is at flash_attn.flash_attn_interface, not flash_attn_interface (top-level). Added to the import chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L LeakyReLU² + VRL + lzma, val_bpb=1.1234

e8ee0dc

Seed 1337: 1.1234 bpb, 15.89MB artifact, 87.1ms/step, 6889 steps. Seeds 42 and 2025 running — logs will be added when complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 24, 2026 23:20

Copilot started reviewing on behalf of anthony-maio March 24, 2026 23:20 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json Outdated Show resolved Hide resolved

anthony-maio changed the title ~~Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1234~~ Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) Mar 24, 2026

This was referenced Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733

Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745

Closed

anthony-maio and others added 7 commits March 25, 2026 13:48

Revert "Add CROWN-Q: curvature-weighted quantization penalty during w…

d4aa289

…armdown" This reverts commit 5e0e793.

Revert "Add Gated Attention (GA) — per-head sigmoid gate on attention…

4ab1c02

… output" This reverts commit 40ec415.

Revert "BigramHash 2048 → 3072: ~-0.0009 bpb per PR openai#549 ablation"

e713943

This reverts commit 08395d8.

anthony-maio closed this Mar 25, 2026

pappanick mentioned this pull request Mar 26, 2026

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) #860

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657
anthony-maio wants to merge 20 commits intoopenai:mainfrom
anthony-maio:submission/reproduce-414

anthony-maio commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

anthony-maio commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anthony-maio commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovations

Architecture

Credits

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

anthony-maio commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anthony-maio commented Mar 24, 2026 •

edited

Loading