Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB)#461
Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Conversation
0f3b82d to
49cb183
Compare
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) on openai#414 stack + Parameter Banking + Parallel Muon. Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum, 3 epochs/32K chunk, freeze first 2 blocks, cosine LR decay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 22, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
the legal ttt protocol here is really well thought out, scoring before updating is the right way to do it. depth recurrence at this scale is underexplored too imo |
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Mar 23, 2026
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Christopher-Lee-McClendon
added a commit
to Christopher-Lee-McClendon/parameter-golf
that referenced
this pull request
Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30 - TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline - Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement - SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT - 15.48MB total (520KB headroom under 16MB limit) - Trained on 4xA100-40GB, eval 3662s on 1xA100
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Christopher-Lee-McClendon
added a commit
to Christopher-Lee-McClendon/parameter-golf
that referenced
this pull request
Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30 - TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline - Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement - SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT - 15.48MB total (520KB headroom under 16MB limit) - Trained on 4xA100-40GB, eval 3662s on 1xA100
Christopher-Lee-McClendon
added a commit
to Christopher-Lee-McClendon/parameter-golf
that referenced
this pull request
Mar 23, 2026
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30 - TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline - Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement - SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT - 15.48MB total (520KB headroom under 16MB limit) - Trained on 4xA100-40GB, eval 3662s on 1xA100
…58 BPB) - 11-layer depth-recurrence GPT (10 unique BlockCores) with legal score-first TTT - Novel high-yield TTT recipe: SGD+momentum(0.9), 3 epochs/chunk, freeze first 2 blocks delivers 2.4x more TTT gain (-0.0165 BPB) than single-epoch AdamW (-0.0068) - Partial RoPE (16/64 dims) with NTK-aware scaling for better length generalization - Value Embeddings (128d) on deep layers 9-10 for richer value representations - Layer-Norm depth scaling (1/sqrt(layer+1)) for stable deep training - XSA last 4, BigramHash(2048), SmearGate, U-Net skips, SWA, Late QAT - Int6+zstd quantization: 14.79MB total (1.2MB headroom under 16MB limit) - Trained on 4xA100-40GB, 5200 steps (~41 min)
49cb183 to
f8ff803
Compare
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Seed 1337: pending (log will be added) Mean: 1.1195 (std 0.0008) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun
added a commit
to abaybektursun/parameter-golf
that referenced
this pull request
Mar 23, 2026
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 23, 2026
…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
srchandrupatla
added a commit
to srchandrupatla/parameter-golf
that referenced
this pull request
Mar 25, 2026
LeakyReLU(0.5)²: preserves negative gradient flow through MLP while maintaining non-negative output. ~0.003 BPB improvement per PR openai#493. Legal TTT (test-time training): at eval time, split val tokens into 32K-token chunks, score each chunk under inference_mode(), then train on the already-scored chunk with SGD. Gives ~0.0025 BPB improvement per PR openai#461. Score-first protocol guarantees no future information leaks into scored tokens.
Mistobaan
pushed a commit
to Mistobaan/parameter-golf
that referenced
this pull request
Mar 25, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mistobaan
pushed a commit
to Mistobaan/parameter-golf
that referenced
this pull request
Mar 25, 2026
TimS-ml
referenced
this pull request
in TimS-ml/parameter-golf-autoresearch
Mar 26, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on #414 stack with Parameter Banking + Parallel Muon (PR #399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TimS-ml
referenced
this pull request
in TimS-ml/parameter-golf-autoresearch
Mar 26, 2026
nedcut
pushed a commit
to nedcut/parameter-golf
that referenced
this pull request
Mar 26, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut
pushed a commit
to nedcut/parameter-golf
that referenced
this pull request
Mar 26, 2026
4 tasks
6 tasks
benjamintd
added a commit
to benjamintd/parameter-golf
that referenced
this pull request
Mar 26, 2026
Eval oracle starts EMPTY and is populated only from already-scored validation chunks. This prevents future data leaking into n-gram statistics. Matches PR openai#461 legal TTT protocol. Also: deep experts (expert_depth=2), 3 experts default.
7 tasks
7 tasks
This was referenced Mar 27, 2026
nvemuri4649
pushed a commit
to thanushpatlolla/parameter-golf
that referenced
this pull request
Mar 27, 2026
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nvemuri4649
pushed a commit
to thanushpatlolla/parameter-golf
that referenced
this pull request
Mar 27, 2026
3 tasks
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-Record Submission: 11L Depth Recurrence + Legal Score-First TTT
val_bpb = 1.14458 | Pre-TTT: 1.1611 | TTT gain: −0.0165 | Artifact: 14.79 MB
Headline
This submission demonstrates that competition-legal test-time training can deliver large gains when properly tuned. The key finding is a TTT recipe — SGD with momentum, multiple epochs per chunk, freezing early blocks — that extracts 2.4× more improvement (−0.0165 BPB) than single-epoch AdamW over the full network (−0.0068 in our prior PR #456).
Every validation token is scored before any weight update that could use it, enforced by
torch.inference_mode()during scoring.Novel & Creative Contributions
High-yield legal TTT via selective freezing + SGD momentum — Most TTT approaches use AdamW for 1 epoch over all parameters. We use SGD+momentum(0.9) for 3 epochs per 32K chunk while freezing the first 2 blocks. This is simpler, uses less memory, and gets 2.4× better TTT gains.
Depth recurrence — 11 logical layers from 10 unique BlockCores (one core reused at two depths with independent normalization). Delivers 11-layer capacity at 10-layer parameter cost.
Partial RoPE (16/64 dims) — Only 16 of 64 head dimensions use rotary embeddings, with NTK-aware scaling. The remaining 48 act as position-agnostic content channels, improving length generalization during TTT.
Value Embeddings on deep layers only — 128-dim learned embeddings added to value projections on layers 9–10, giving deep layers direct token-identity access in the value stream.
Layer-Norm depth scaling —
1/√(layer+1)scaling stabilizes training under depth recurrence.Architecture
TTT Protocol
Size Budget
vs. Prior Submission (PR #456)
Pre-TTT baselines are nearly identical — the entire improvement comes from better TTT, validating the recipe.
Credits
This submission builds on work from many contributors to the parameter-golf competition:
Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.