Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657
Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657anthony-maio wants to merge 20 commits intoopenai:mainfrom
Conversation
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RunPod parameter-golf template doesn't have flash-attn pre-installed. Falls back to F.scaled_dot_product_attention with GQA expansion. Slower (~120ms vs 84ms) but functional for testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Hopper interface is at flash_attn.flash_attn_interface, not flash_attn_interface (top-level). Added to the import chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…RIDE P1: TTT was running on the pre-quantization base_model instead of the int6 round-tripped eval_model. This overstated TTT gains since the artifact model has quantization noise. Now matches PR openai#473's approach. P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now honors the configured stride so TTT results stay consistent with the sliding window eval path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Saves layer 0's raw V output and blends it into all subsequent layers via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing). PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT). VRL is orthogonal to our existing VE128 (shared value embedding). Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer (10 params total for 11L). Zero compute overhead beyond the gated blend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Saves 19.6KB of code size toward fitting under 16MB artifact limit. Model binary (16.08MB) + code (60KB) = 16.14MB, still 139KB over. Next step: reduce model size via tighter GPTQ-lite or smaller MLP. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1472 % 128 == 0, perfect for H100 tensor cores. Saves ~720K params (65K per block × 11), ~324KB compressed. Should bring artifact from 16.16MB to ~15.8MB, under the 16MB cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 42 produced 16.72MB artifact (over 16MB limit) despite seed 1337 fitting at 15.45MB. Two changes to ensure all seeds fit: - bigram_vocab_size 2048→1536: saves ~192KB (64K fewer hash buckets) - Control tensors (attn_scale, mlp_scale, resid_mix, etc.) stored as fp16 instead of fp32: saves ~57KB. Dequant path already handles fp16→fp32 upcast. Combined savings ~250KB should keep all seeds under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key insight from council: PR openai#549 uses lzma, not zstd. lzma is stdlib (no pip install needed!) and compresses 2-5% tighter on quantized weights. This recovers ~300-800KB headroom, enough to restore: - mlp_mult: 2.875 → 3.0 (recover ~0.001-0.002 bpb) - bigram_vocab_size: 1536 → 2048 (recover ~0.001 bpb) - lzma.compress(data, preset=6) replaces zstd-22/zlib Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.1234 bpb, 15.89MB artifact, 87.1ms/step, 6889 steps. Seeds 42 and 2025 running — logs will be added when complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new 10min/16MB record entry implementing LeakyReLU(0.5)² + Value Residual Learning (VRL) and switching artifact compression to stdlib lzma, along with a companion “reproduce #414” entry for comparison.
Changes:
- New record training script and metadata for
2026-03-24_LeakyReLU2_VRL_LZMA(VRL + LeakyReLU² + lzma-compressed int6 artifact). - Added README documenting the approach and reproduction steps.
- Added a
2026-03-23_Reproduce414_LegalTTTfolder with a matching training script and placeholder submission metadata.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py | Implements VRL gates and lzma-compressed int6 export + eval/roundtrip. |
| records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/submission.json | Captures record metadata (val_bpb, sizes, blurb). |
| records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/README.md | Describes innovations and provides reproduction instructions. |
| records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py | Baseline/repro script (same core training/export pipeline). |
| records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json | Placeholder metadata for the repro entry (currently null fields). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/README.md
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json
Outdated
Show resolved
Hide resolved
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8ee0dc3b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json
Outdated
Show resolved
Hide resolved
Seed 1337: 1.1234 bpb, 15.89MB Seed 42: 1.1225 bpb, 15.88MB Seed 2025: 1.1228 bpb, 15.89MB Mean: 1.1229 (std 0.0005) All 3 artifacts under 16MB. All logs attached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove misleading 'int8+zlib' log labels, use correct int6+compressor - Remove unused late_k_layers variable in quantization - Fix submission.json null fields with actual values - All changes in both 2026-03-23 and 2026-03-24 record folders Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer 0's V output is blended 50/50 into all subsequent layers' V. Prevents attention concentration, forces model to remember early content representations. Zero extra params, minimal speed cost. Proven in competition PR openai#657 (1.1229 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR should only add records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/ Removed: - records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/ (duplicate submission) - .private/substack_draft_notes.md (not part of submission) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extra ~177KB compressed. Current artifact at 15.89MB, cap at 16MB. Tight but should fit — bigram weights are highly compressible (initialized near zero). Will validate on next run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Input-dependent gate: sigmoid(Linear(x)) applied per-head after SDPA. Init: weight=zeros, bias=4.0 (sigmoid(4)≈0.98, near-identity start). Eliminates attention sinks. ~0.002-0.003 bpb gain per PR openai#638 ablation. Stacks additively with VRL (combined: -0.017 in 9L ablation). ~45K params total (negligible). attn_gate added to control tensor patterns. Enabled by default (GA_ENABLED=1). Credit: PR openai#638, arXiv:2505.06708. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10-line training-time regularization that pushes weights toward flat minima where int6 rounding does less damage. Penalty per row: lambda * mean(w²) * (row_max/15)² / 12 Over-penalizes (uses /15 vs actual /31) for extra margin. Active only when QAT is enabled (warmdown phase). Zero eval cost. Fully legal per issue openai#677 (training-time only). CROWNQ_LAMBDA=0.01 (default). Credit: PR openai#693. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…armdown" This reverts commit 5e0e793.
… output" This reverts commit 40ec415.
This reverts commit 08395d8.
|
Superseded by PR #175 (earlier timestamp, same results). |
- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB - Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB - Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1 - Added attn_gate, lambda_v to control tensor patterns for proper quantization handling - All smoke tests pass on CPU
Summary
val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.
Key Innovations
LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.
Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.
lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.
Architecture
PR #414 base + LeakyReLU² + VRL + lzma:
Credits
Test plan
🤖 Generated with Claude Code