Skip to content

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB)#461

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-ve128-partial-rope-legal-ttt
Open

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB)#461
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-ve128-partial-rope-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 22, 2026

Non-Record Submission: 11L Depth Recurrence + Legal Score-First TTT

val_bpb = 1.14458 | Pre-TTT: 1.1611 | TTT gain: −0.0165 | Artifact: 14.79 MB

Non-record unlimited-compute submission (trained on 4×A100-40GB).


Headline

This submission demonstrates that competition-legal test-time training can deliver large gains when properly tuned. The key finding is a TTT recipe — SGD with momentum, multiple epochs per chunk, freezing early blocks — that extracts 2.4× more improvement (−0.0165 BPB) than single-epoch AdamW over the full network (−0.0068 in our prior PR #456).

Every validation token is scored before any weight update that could use it, enforced by torch.inference_mode() during scoring.

Novel & Creative Contributions

  1. High-yield legal TTT via selective freezing + SGD momentum — Most TTT approaches use AdamW for 1 epoch over all parameters. We use SGD+momentum(0.9) for 3 epochs per 32K chunk while freezing the first 2 blocks. This is simpler, uses less memory, and gets 2.4× better TTT gains.

  2. Depth recurrence — 11 logical layers from 10 unique BlockCores (one core reused at two depths with independent normalization). Delivers 11-layer capacity at 10-layer parameter cost.

  3. Partial RoPE (16/64 dims) — Only 16 of 64 head dimensions use rotary embeddings, with NTK-aware scaling. The remaining 48 act as position-agnostic content channels, improving length generalization during TTT.

  4. Value Embeddings on deep layers only — 128-dim learned embeddings added to value projections on layers 9–10, giving deep layers direct token-identity access in the value stream.

  5. Layer-Norm depth scaling1/√(layer+1) scaling stabilizes training under depth recurrence.

Architecture

Component Setting
Layers 11 logical (10 shared BlockCores)
Dim/Heads 512 / 8 (4 KV heads)
MLP 3× (1536), ReLU²; SmearGate
BigramHash 2048
RoPE 16/64 dims, NTK scaling
Value Embed 128d on layers 9–10
XSA Last 4 layers
Quant Int6 + zstd

TTT Protocol

for each 32K chunk:
    1. model.eval() + inference_mode → score chunk (NLL recorded)
    2. model.train() → SGD(lr=0.002, mom=0.9), 3 epochs, freeze blocks 0-1

Size Budget

Component Bytes
Model (int6+zstd) 14,717,713
Code 71,706
Total 14,789,419 (< 16,000,000)

vs. Prior Submission (PR #456)

PR #456 This PR Δ
val_bpb 1.15321 1.14458 −0.00863
Pre-TTT 1.1600 1.1611 +0.0011
TTT gain −0.0068 −0.0165 2.4×

Pre-TTT baselines are nearly identical — the entire improvement comes from better TTT, validating the recipe.

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/11L-ve128-partial-rope-legal-ttt branch from 0f3b82d to 49cb183 Compare March 22, 2026 21:37
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) on openai#414 stack + Parameter Banking + Parallel Muon.
Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum, 3 epochs/32K chunk, freeze first 2 blocks, cosine LR decay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

the legal ttt protocol here is really well thought out, scoring before updating is the right way to do it. depth recurrence at this scale is underexplored too imo

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 23, 2026
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE,
LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22).

Added legal score-first TTT from PR openai#461/openai#473 protocol:
- SGD + momentum 0.9, lr=0.002 with cosine decay
- 3 epochs per 32K token chunk
- Freeze blocks 0-1
- Score each chunk BEFORE training on it (inference_mode)
- Expected ~0.002 bpb improvement over base

Strategy shift: reproduce proven frontier instead of iterating on
our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding
legal TTT should push to ~1.121.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with
Parameter Banking + Parallel Muon (first introduced in PR openai#399).

Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s.
Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128.

Every token scored BEFORE model adapts (inference_mode enforced).
SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30
- TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline
- Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement
- SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT
- 15.48MB total (520KB headroom under 16MB limit)
- Trained on 4xA100-40GB, eval 3662s on 1xA100
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
…d mean)

Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0
on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9):
  Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB
  Seed 42:   1.1216 bpb, 406s TTT, 15.99 MB
  Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB
  Mean:      1.1214 (std 0.0009)

All artifacts under 16MB. All eval times under 600s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30
- TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline
- Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement
- SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT
- 15.48MB total (520KB headroom under 16MB limit)
- Trained on 4xA100-40GB, eval 3662s on 1xA100
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30
- TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline
- Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement
- SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT
- 15.48MB total (520KB headroom under 16MB limit)
- Trained on 4xA100-40GB, eval 3662s on 1xA100
…58 BPB)

- 11-layer depth-recurrence GPT (10 unique BlockCores) with legal score-first TTT
- Novel high-yield TTT recipe: SGD+momentum(0.9), 3 epochs/chunk, freeze first 2 blocks
  delivers 2.4x more TTT gain (-0.0165 BPB) than single-epoch AdamW (-0.0068)
- Partial RoPE (16/64 dims) with NTK-aware scaling for better length generalization
- Value Embeddings (128d) on deep layers 9-10 for richer value representations
- Layer-Norm depth scaling (1/sqrt(layer+1)) for stable deep training
- XSA last 4, BigramHash(2048), SmearGate, U-Net skips, SWA, Late QAT
- Int6+zstd quantization: 14.79MB total (1.2MB headroom under 16MB limit)
- Trained on 4xA100-40GB, 5200 steps (~41 min)
@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/11L-ve128-partial-rope-legal-ttt branch from 49cb183 to f8ff803 Compare March 23, 2026 15:37
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Seed 1337: pending (log will be added)
  Mean:      1.1195 (std 0.0008)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 23, 2026
…bpb 1.1178

3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM.

Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate
with sigmoid soft-round in the backward pass during the final 2% of training,
giving bin-aware gradients that settle weights onto int6 grid points.

Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
srchandrupatla added a commit to srchandrupatla/parameter-golf that referenced this pull request Mar 25, 2026
LeakyReLU(0.5)²: preserves negative gradient flow through MLP while
maintaining non-negative output. ~0.003 BPB improvement per PR openai#493.

Legal TTT (test-time training): at eval time, split val tokens into
32K-token chunks, score each chunk under inference_mode(), then train
on the already-scored chunk with SGD. Gives ~0.0025 BPB improvement
per PR openai#461. Score-first protocol guarantees no future information
leaks into scored tokens.
Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026
TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
#414 stack with Parameter Banking + Parallel Muon (PR #399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
benjamintd added a commit to benjamintd/parameter-golf that referenced this pull request Mar 26, 2026
Eval oracle starts EMPTY and is populated only from already-scored
validation chunks. This prevents future data leaking into n-gram
statistics. Matches PR openai#461 legal TTT protocol.
Also: deep experts (expert_depth=2), 3 experts default.
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants