Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) by mrdavtan · Pull Request #481 · openai/parameter-golf

mrdavtan · 2026-03-23T00:28:26Z

Summary

val_bpb=1.0970 (3-seed mean, std=0.0010). 15.4-15.8 MB artifact. 8xH100 SXM, FA2.

Seed	Steps	Pre-TTT	Post-TTT	Artifact
1337	7,101	1.1577	1.0959	15.4 MB
42	6,700	1.1588	1.0971	15.5 MB
7	6,987	1.1580	1.0979	15.8 MB

Training architecture follows the community stack. The main change from prior work is in the TTT schedule. All runs used FA2; FA3 Hopper would improve pre-TTT quality through faster training steps. The schedule is independent of the attention kernel and should apply to any architecture.

TTT scheduling

Two modifications to AdamW TTT (PR #442):

Cosine lr decay over 30 epochs. Starts at full lr to address large-scale quantization damage, progressively reduces to refine without overshooting. Flat lr must compromise between these two regimes.

Per-layer lr groups based on quantization damage. MLP output projections showed 3.4× higher relative quantization error than input projections on our trained checkpoint. TTT receives: 3× base lr for output projections, 0.5× for input projections, 1× for the rest. Ratios are model-specific.

TTT_OPTIMIZER=adamw  TTT_LR=0.0005  TTT_EPOCHS=30
TTT_COSINE=1  TTT_PERLAYER=1  TTT_FREEZE_BLOCKS=0
TTT_BATCH_SEQS=64 (per GPU, 512 total with DDP sharding)

30 epochs at ~15.5s/epoch = ~465s total. We also tested flat lr, SGD, focal loss, and KL divergence from pre-quant model. Focal loss and KL divergence did not improve over cross-entropy. Full comparison in the README.

Other findings

See PR #212 for a non-record submission documenting 25+ experiments with negative results on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization, and depth recurrence.

Acknowledgments

Training architecture: PRs Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162 (raahilshah), Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180 (thwu1)
TTT as eval-time technique: PR [record bpb=1.195] sliding window + LoRA TTT #77 (samacqua)
Aggressive TTT: PR Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) #398 (felipe-parodi)
AdamW for TTT: PR Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442 (sjp611)
DDP TTT sharding: PRs Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) #398, Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (3-seed mean val_bpb=1.1227) #417 (EthanYangTW)
Partial RoPE, LN Scale: PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (jfprincz)
XSA: arXiv:2603.09078
Muon, modded-nanogpt: foundation codebase

Reproduction

git clone https://github.com/mrdavtan/parameter-golf.git
cd parameter-golf && git checkout next-gen
pip install flash-attn --no-cache-dir --no-build-isolation
pip install zstandard sentencepiece huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024
bash run_competition.sh 1337

Hardware: 8xH100 SXM (RunPod), PyTorch 2.9.1+cu128, Flash Attention 2

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR #481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mrdavtan · 2026-03-23T05:29:02Z

Updating to sliding window eval. New scores in progress. The per-layer lr groups, cosine scheduling, and ablation findings are independent of the evaluation method.

…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT

mrdavtan · 2026-03-23T17:33:04Z

Closing: multi-epoch TTT is invalid per the clarified rules in #402. The per-layer LR and cosine scheduling contributions remain available for legal single-pass (Case 2) TTT implementations.

Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR openai#481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025). Changes: - train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams - run_3seeds.sh: Added TTT env vars for 3-seed validation - finalize_submission.py: Extracts pre/post TTT metrics from logs - README.md + submission.json: Updated for TTT-enabled submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On PR openai#549: Replace 3ep SGD TTT (-0.0025 bpb) with PR openai#481's AdamW recipe (30ep cosine decay, per-layer LR: mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB, eval ~589s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

ndokutovich added a commit to ndokutovich/parameter-golf that referenced this pull request Mar 23, 2026

update: add cosine TTT + per-layer LR (from PR openai#481)

b39ce05

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

andrewbaggio1 mentioned this pull request Mar 23, 2026

Non-record: Cosine TTT 30ep on SwiGLU + U-Net (1xH100, val_bpb=1.1175) #509

Open

4 tasks

sofiabod mentioned this pull request Mar 23, 2026

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518

Closed

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537

Open

mrdavtan closed this Mar 23, 2026

anantdgoel mentioned this pull request Mar 24, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

Robby955 mentioned this pull request Mar 24, 2026

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) #639

Closed

minh-stakc mentioned this pull request Mar 24, 2026

Record: 11L + Score-Every-Epoch LoRA TTT 5ep (3-seed mean val_bpb=0.8173) #642

Closed

4 tasks

This was referenced Mar 25, 2026

Non-record: 30ep Cosine TTT on SwiGLU + U-Net (1xH100, val_bpb=1.1175) #661

Open

Record: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781) #672

Open

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

This was referenced Mar 25, 2026

Record: Chained TTT — Cosine Recovery + Multi-Pass Scoring (3-seed mean val_bpb=1.0366) #685

Closed

Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850) #741

Open

sunnypatneedi mentioned this pull request Mar 25, 2026

Record: AdamW TTT 30ep Cosine + Per-Layer LR (val_bpb: 1.0705) #771

Closed

aruniyer mentioned this pull request Mar 26, 2026

Non-record: 15L Depth Recurrence + LeakyReLU² — BI-guided weight tying (val_bpb=1.1360) #857

Draft

This was referenced Mar 28, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Merged

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR #995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:cosine-ttt-record

mrdavtan commented Mar 23, 2026

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrdavtan commented Mar 23, 2026

Summary

TTT scheduling

Other findings

Acknowledgments

Reproduction

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant