Skip to content

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed)#638

Closed
Asukabot0 wants to merge 14 commits intoopenai:mainfrom
Asukabot0:submission/xsa-all-leakyrelu-vr-ga
Closed

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed)#638
Asukabot0 wants to merge 14 commits intoopenai:mainfrom
Asukabot0:submission/xsa-all-leakyrelu-vr-ga

Conversation

@Asukabot0
Copy link
Copy Markdown

@Asukabot0 Asukabot0 commented Mar 24, 2026

Summary

val_bpb = 1.1164 (single seed 1337, pending 3-seed validation) | 15.94 MB | 8xH100 SXM | No TTT

Non-TTT submission within 0.001 BPB of current non-TTT SOTA (1.1154, PR #609). Requesting compute grant for 8xH100 3-seed validation.

Architecture

  • 11L, 512d, 8H/4KV (GQA), MLP 3x, LeakyReLU(0.5)²
  • XSA on all 11 layers (-0.006 BPB vs XSA-last-4)
  • Value Residual + Gated Attention (-0.002 BPB combined)
  • SmearGate, BigramHash(4096), Partial RoPE 16/64, LN Scale
  • EMA(0.997), int6 per-row + zstd-21, U-Net skip connections

Superseded by #761 (Score-First TTT + N-gram Backoff, 3-seed mean val_bpb=0.9581).

Non-TTT submission: XSA on all 11 layers, LeakyReLU(0.5)², Value Residual,
Gated Attention. Single-GPU 7500-step result, pending 8xH100 3-seed validation.
Artifact 15.94MB (zstd-21). Requesting compute grant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Asukabot0 and others added 7 commits March 25, 2026 16:03
12 defaults were inherited from old PR#398 base and didn't match
the actual p17 experiment config:
- WARMDOWN_ITERS: 1200 -> 3500
- MATRIX_LR: 0.04 -> 0.025
- SCALAR_LR: 0.04 -> 0.025
- TIED_EMBED_LR: 0.05 -> 0.035
- SWA_ENABLED: 1 -> 0
- XSA_LAST_N: 0 -> 11
- LEAKY_RELU: 0 -> 1
- MUON_MOMENTUM: 0.95 -> 0.99
- MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92
- MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500
- TTT_ENABLED: 1 -> 0
- ZSTD_LEVEL: 22 -> 21 (configurable via env var)

Now the code runs p17 config with zero env vars needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
find_unused_parameters=True was enabled for VR+GA (layer 0's vr_lambda
is unused when v0=None). This forces DDP to scan the entire autograd
graph every backward pass, causing ~3x slowdown on 8xH100 (288ms vs
expected ~87ms/step).

static_graph=True only checks once on first iteration then caches,
which is much more efficient with torch.compile.

This only affects multi-GPU runs (single GPU doesn't use DDP).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes for 8xH100 3-seed submission:
- Artifact auto-downgrade: try int6+zstd [16,1,17,2], fall back to
  int5 middle layers (L2-8) if still over 16MB
- Warmdown default 3000 (was 1200): 46.5% ratio on 8xH100 matches
  single-GPU 47%, fixes v9's 54% over-warmdown
- 5-gram eval cache auto-enabled on multi-GPU (world_size>1),
  alpha=0.20, order=5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of downgrading all middle layers (L2-8) to int5 at once
(wasting 2.1MB and +0.014 BPB), now downgrades one layer at a time
expanding outward from center (L5→L6→L4→L7→...).

Tested: single layer (L5) saves ~290KB, enough to fit most seeds.
BPB penalty reduced from ~0.014 to ~0.002.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Train 1 seed, then sweep alpha=[0.10-0.30] and order=[3-7]
using EVAL_ONLY mode. Each eval ~3min on 8xH100.
Total sweep time: ~10min train + 9×3min eval = ~37min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Best from 20-point grid search on 8xH100:
  alpha=0.40 order=7 → 1.0336 BPB (vs 1.0517 at alpha=0.20 order=5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two eval-time improvements (no retraining needed):

1. Multi-order backoff (orders 2-7): When 7-gram has no cache hit,
   falls back to 6/5/4/3/2-gram. Dramatically increases cache hit rate
   on 8xH100 where per-GPU cache is sparse. PR openai#702 reports -0.018 BPB.

2. Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4.0))
   Model uncertain → trust n-gram more. Model confident → keep LM.
   Compliant: alpha depends only on model's own distribution.

Both configurable via env vars (NGRAM_ENTROPY=0 to disable adaptive).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 25, 2026
Input-dependent gate: sigmoid(Linear(x)) applied per-head after SDPA.
Init: weight=zeros, bias=4.0 (sigmoid(4)≈0.98, near-identity start).
Eliminates attention sinks. ~0.002-0.003 bpb gain per PR openai#638 ablation.
Stacks additively with VRL (combined: -0.017 in 9L ablation).
~45K params total (negligible). attn_gate added to control tensor patterns.

Enabled by default (GA_ENABLED=1).
Credit: PR openai#638, arXiv:2505.06708.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Asukabot0 and others added 6 commits March 26, 2026 03:31
1. Rewrite ttt_adapt() to score-first pattern (Issue openai#677 compliant):
   - Process val data in sequential chunks (TTT_CHUNK_TOKENS=131072)
   - Phase 1: score chunk under inference_mode (forward only)
   - Phase 2: train on scored tokens with AdamW (K epochs)
   - Each token scored BEFORE model trains on it

2. Switch TTT optimizer from SGD to AdamW (lr=0.0001, wd=0.0)
   - PR openai#700 showed AdamW >> SGD for TTT
   - Default 4 epochs, freeze first 2 blocks

3. Fix DDP find_unused_parameters → static_graph=True
   - Same 3x slowdown fix as submission directory

4. TTT defaults: disabled by default (TTT_ENABLED=0)
   - Enable with TTT_ENABLED=1 for TTT+n-gram combined eval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 defaults were wrong (inherited from old PR#398 base):
- MATRIX_LR: 0.04 -> 0.025
- SCALAR_LR: 0.04 -> 0.025
- TIED_EMBED_LR: 0.05 -> 0.035
- SWA_ENABLED: 1 -> 0
- XSA_LAST_N: 0 -> 11
- LEAKY_RELU: 0 -> 1
- MUON_MOMENTUM: 0.95 -> 0.99
- MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92
- MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500

Previous PR openai#727 runs worked because env vars were passed manually.
After cloud restart, defaults kicked in producing wrong model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inspired by PR openai#757 which found SGD LR=1.0 gives 16x better TTT gain
than conventional LR=0.002. Key changes:

- TTT_OPTIMIZER env var: "sgd" (default) or "adamw"
- Default LR: 0.0001 -> 1.0 (SGD)
- Default epochs: 4 -> 20
- Default freeze_blocks: 2 -> 0 (all unfrozen)

PR openai#757 showed: freeze=0 + high LR converges fine, extra capacity
absorbs aggressive learning rate. 20ep × ~16s = ~320s on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka's 7-point sweep showed monotonic improvement with
higher slopes. 0.9 beats 0.5 by 0.013 BPP + 200 more steps (less
dead activation = faster per step).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defaults now match the exact config that produced the verified results:
- TTT: AdamW lr=0.0001, 4 epochs, freeze_blocks=2
- LeakyReLU slope: 0.5
- Score-first TTT (Issue openai#677 compliant)

3-seed results: 0.9576/0.9581/0.9585 (mean=0.9581, std=0.0005)
All artifacts <16MB, all eval <600s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_bpb=0.9581)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Asukabot0 Asukabot0 changed the title Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) Mar 25, 2026
@Asukabot0 Asukabot0 changed the title Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) Mar 25, 2026
@Asukabot0
Copy link
Copy Markdown
Author

Superseded by #761 (Score-First TTT + N-gram Backoff, 3-seed mean val_bpb=0.9581).

@Asukabot0 Asukabot0 closed this Mar 25, 2026
pappanick added a commit to pappanick/parameter-golf that referenced this pull request Mar 26, 2026
- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB
- Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB
- Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1
- Added attn_gate, lambda_v to control tensor patterns for proper quantization handling
- All smoke tests pass on CPU
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 27, 2026
… integration

Final implementation batch:

1. VRL (Value Residual Learning, arXiv:2410.17897): First layer's V
   carried to all deeper layers via learned lambda mixing. Addresses
   attention concentration. Replaces VE128 (PR openai#638: -0.002 BPB).

2. Gated Attention: Per-head learned sigmoid gate on attention output.
   Initialized near-open (bias=4.0). Combined with VRL for -0.002 BPB.

3. BigramHash embedding: Hash-based word-pair lookup table.
   (prev_token, curr_token) → bucket → embedding → project to model_dim.
   3072 buckets × 128 dim. From PR openai#414: -0.003 BPB.

Full stack verified locally — all 12 features work together:
  Architecture: 7×2 recurrence + MLP3x + XSA-all + VRL + GA + BigramHash
  Training: CROWN-Q + configurable int5/int6/int8
  Eval: 5-expert Hedge Mixer + TTT
  Total: 17.6M params, 1647 lines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant