Skip to content

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657

Closed
anthony-maio wants to merge 20 commits intoopenai:mainfrom
anthony-maio:submission/reproduce-414
Closed

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)#657
anthony-maio wants to merge 20 commits intoopenai:mainfrom
anthony-maio:submission/reproduce-414

Conversation

@anthony-maio
Copy link
Copy Markdown

@anthony-maio anthony-maio commented Mar 24, 2026

Summary

val_bpb = 1.1229 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps val_bpb Artifact
1337 87.1ms 6,889 1.1234 15,887,926
42 88.0ms 6,818 1.1225 15,877,570
2025 87.5ms 6,857 1.1228 15,890,566
Mean 87.5ms 6,855 1.1229 (std 0.0005)

All 3 artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovations

LeakyReLU(0.5)²: One-line activation swap preserving negative gradient flow through MLP. ~-0.002 BPB vs standard relu². Credit: PR #493 @parinzee, PR #518 @sofiabod.

Value Residual Learning (VRL): Layer 0's V output blended into all subsequent attention layers via learned sigmoid gates. Combats attention concentration (ResFormer, arXiv:2410.17897). +10 scalar params. Credit: PR #569 @gowtham0992.

lzma compression: Stdlib replacement for zstd-22, compresses 2-5% tighter on quantized weights. Recovers ~300-500KB headroom, enabling full MLP 3× + BigramHash 2048 under 16MB without capacity cuts. No external dependencies.

Architecture

PR #414 base + LeakyReLU² + VRL + lzma:

Component Details
Layers 11L, 512d, 8H/4KV (GQA), U-Net skips (5 enc, 6 dec)
MLP 3× expansion (1536), LeakyReLU(0.5)² activation
Attention XSA4, Partial RoPE 16/64, LN Scale 1/√(i+1), VRL
Embeddings BigramHash(2048), VE128 (layers 9-10), SmearGate
Training EMA(0.997) + Tight SWA, Late QAT (STE@0.15), OrthoInit
Optimizer Muon WD=0.04, warmdown=3500, batch=786K tokens
Quantization GPTQ-lite int6 + lzma (preset=6)
Attention kernel FlashAttention 3 (Hopper native)

Credits

Test plan

  • Seed 1337: 1.1234 bpb, 15.89MB valid
  • Seed 42: 1.1225 bpb, 15.88MB valid
  • Seed 2025: 1.1228 bpb, 15.89MB valid
  • 3-seed mean: 1.1229, std 0.0005
  • All 3 train logs attached
  • All artifacts under 16,000,000 bytes

🤖 Generated with Claude Code

anthony-maio and others added 11 commits March 22, 2026 21:03
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE,
LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22).

Added legal score-first TTT from PR openai#461/openai#473 protocol:
- SGD + momentum 0.9, lr=0.002 with cosine decay
- 3 epochs per 32K token chunk
- Freeze blocks 0-1
- Score each chunk BEFORE training on it (inference_mode)
- Expected ~0.002 bpb improvement over base

Strategy shift: reproduce proven frontier instead of iterating on
our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding
legal TTT should push to ~1.121.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RunPod parameter-golf template doesn't have flash-attn pre-installed.
Falls back to F.scaled_dot_product_attention with GQA expansion.
Slower (~120ms vs 84ms) but functional for testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Hopper interface is at flash_attn.flash_attn_interface, not
flash_attn_interface (top-level). Added to the import chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…RIDE

P1: TTT was running on the pre-quantization base_model instead of the
int6 round-tripped eval_model. This overstated TTT gains since the
artifact model has quantization noise. Now matches PR openai#473's approach.

P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now
honors the configured stride so TTT results stay consistent with
the sliding window eval path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb
from this change. LeakyReLU preserves gradient flow through negative
pre-activations while maintaining the sparsity/gating benefits of
squaring. At 22M params, dead neurons from hard ReLU are expensive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Saves layer 0's raw V output and blends it into all subsequent layers
via learned sigmoid gates (initialized at -1.5 ≈ 18% mixing).
PR openai#569 achieves 1.1175 with VRL+LeakyReLU²+Full GPTQ (no TTT).
VRL is orthogonal to our existing VE128 (shared value embedding).

Enabled by default (VRL_ENABLED=1). Gate adds 1 scalar param per layer
(10 params total for 11L). Zero compute overhead beyond the gated blend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Saves 19.6KB of code size toward fitting under 16MB artifact limit.
Model binary (16.08MB) + code (60KB) = 16.14MB, still 139KB over.
Next step: reduce model size via tighter GPTQ-lite or smaller MLP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1472 % 128 == 0, perfect for H100 tensor cores.
Saves ~720K params (65K per block × 11), ~324KB compressed.
Should bring artifact from 16.16MB to ~15.8MB, under the 16MB cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 42 produced 16.72MB artifact (over 16MB limit) despite seed 1337
fitting at 15.45MB. Two changes to ensure all seeds fit:
- bigram_vocab_size 2048→1536: saves ~192KB (64K fewer hash buckets)
- Control tensors (attn_scale, mlp_scale, resid_mix, etc.) stored as
  fp16 instead of fp32: saves ~57KB. Dequant path already handles
  fp16→fp32 upcast.

Combined savings ~250KB should keep all seeds under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key insight from council: PR openai#549 uses lzma, not zstd. lzma is stdlib
(no pip install needed!) and compresses 2-5% tighter on quantized
weights. This recovers ~300-800KB headroom, enough to restore:
- mlp_mult: 2.875 → 3.0 (recover ~0.001-0.002 bpb)
- bigram_vocab_size: 1536 → 2048 (recover ~0.001 bpb)
- lzma.compress(data, preset=6) replaces zstd-22/zlib

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 1337: 1.1234 bpb, 15.89MB artifact, 87.1ms/step, 6889 steps.
Seeds 42 and 2025 running — logs will be added when complete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 24, 2026 23:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16MB record entry implementing LeakyReLU(0.5)² + Value Residual Learning (VRL) and switching artifact compression to stdlib lzma, along with a companion “reproduce #414” entry for comparison.

Changes:

  • New record training script and metadata for 2026-03-24_LeakyReLU2_VRL_LZMA (VRL + LeakyReLU² + lzma-compressed int6 artifact).
  • Added README documenting the approach and reproduction steps.
  • Added a 2026-03-23_Reproduce414_LegalTTT folder with a matching training script and placeholder submission metadata.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/train_gpt.py Implements VRL gates and lzma-compressed int6 export + eval/roundtrip.
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/submission.json Captures record metadata (val_bpb, sizes, blurb).
records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/README.md Describes innovations and provides reproduction instructions.
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py Baseline/repro script (same core training/export pipeline).
records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/submission.json Placeholder metadata for the repro entry (currently null fields).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8ee0dc3b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Seed 1337: 1.1234 bpb, 15.89MB
Seed 42:   1.1225 bpb, 15.88MB
Seed 2025: 1.1228 bpb, 15.89MB
Mean:      1.1229 (std 0.0005)

All 3 artifacts under 16MB. All logs attached.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anthony-maio anthony-maio changed the title Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1234 Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) Mar 24, 2026
- Remove misleading 'int8+zlib' log labels, use correct int6+compressor
- Remove unused late_k_layers variable in quantization
- Fix submission.json null fields with actual values
- All changes in both 2026-03-23 and 2026-03-24 record folders

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Mar 25, 2026
Layer 0's V output is blended 50/50 into all subsequent layers' V.
Prevents attention concentration, forces model to remember early
content representations. Zero extra params, minimal speed cost.
Proven in competition PR openai#657 (1.1229 BPB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio and others added 7 commits March 25, 2026 13:48
PR should only add records/track_10min_16mb/2026-03-24_LeakyReLU2_VRL_LZMA/
Removed:
- records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/ (duplicate submission)
- .private/substack_draft_notes.md (not part of submission)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extra ~177KB compressed. Current artifact at 15.89MB, cap at 16MB.
Tight but should fit — bigram weights are highly compressible
(initialized near zero). Will validate on next run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Input-dependent gate: sigmoid(Linear(x)) applied per-head after SDPA.
Init: weight=zeros, bias=4.0 (sigmoid(4)≈0.98, near-identity start).
Eliminates attention sinks. ~0.002-0.003 bpb gain per PR openai#638 ablation.
Stacks additively with VRL (combined: -0.017 in 9L ablation).
~45K params total (negligible). attn_gate added to control tensor patterns.

Enabled by default (GA_ENABLED=1).
Credit: PR openai#638, arXiv:2505.06708.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10-line training-time regularization that pushes weights toward flat
minima where int6 rounding does less damage. Penalty per row:
  lambda * mean(w²) * (row_max/15)² / 12
Over-penalizes (uses /15 vs actual /31) for extra margin.
Active only when QAT is enabled (warmdown phase). Zero eval cost.
Fully legal per issue openai#677 (training-time only).

CROWNQ_LAMBDA=0.01 (default). Credit: PR openai#693.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anthony-maio
Copy link
Copy Markdown
Author

Superseded by PR #175 (earlier timestamp, same results).

pappanick added a commit to pappanick/parameter-golf that referenced this pull request Mar 26, 2026
- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB
- Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB
- Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1
- Added attn_gate, lambda_v to control tensor patterns for proper quantization handling
- All smoke tests pass on CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants