Skip to content

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481

Closed
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:cosine-ttt-record
Closed

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:cosine-ttt-record

Conversation

@mrdavtan
Copy link
Copy Markdown

Summary

val_bpb=1.0970 (3-seed mean, std=0.0010). 15.4-15.8 MB artifact. 8xH100 SXM, FA2.

Seed Steps Pre-TTT Post-TTT Artifact
1337 7,101 1.1577 1.0959 15.4 MB
42 6,700 1.1588 1.0971 15.5 MB
7 6,987 1.1580 1.0979 15.8 MB

Training architecture follows the community stack. The main change from prior work is in the TTT schedule. All runs used FA2; FA3 Hopper would improve pre-TTT quality through faster training steps. The schedule is independent of the attention kernel and should apply to any architecture.

TTT scheduling

Two modifications to AdamW TTT (PR #442):

Cosine lr decay over 30 epochs. Starts at full lr to address large-scale quantization damage, progressively reduces to refine without overshooting. Flat lr must compromise between these two regimes.

Per-layer lr groups based on quantization damage. MLP output projections showed 3.4× higher relative quantization error than input projections on our trained checkpoint. TTT receives: 3× base lr for output projections, 0.5× for input projections, 1× for the rest. Ratios are model-specific.

TTT_OPTIMIZER=adamw  TTT_LR=0.0005  TTT_EPOCHS=30
TTT_COSINE=1  TTT_PERLAYER=1  TTT_FREEZE_BLOCKS=0
TTT_BATCH_SEQS=64 (per GPU, 512 total with DDP sharding)

30 epochs at ~15.5s/epoch = ~465s total. We also tested flat lr, SGD, focal loss, and KL divergence from pre-quant model. Focal loss and KL divergence did not improve over cross-entropy. Full comparison in the README.

Other findings

See PR #212 for a non-record submission documenting 25+ experiments with negative results on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization, and depth recurrence.

Acknowledgments

Reproduction

git clone https://github.com/mrdavtan/parameter-golf.git
cd parameter-golf && git checkout next-gen
pip install flash-attn --no-cache-dir --no-build-isolation
pip install zstandard sentencepiece huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024
bash run_competition.sh 1337

Hardware: 8xH100 SXM (RunPod), PyTorch 2.9.1+cu128, Flash Attention 2

… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
ndokutovich added a commit to ndokutovich/parameter-golf that referenced this pull request Mar 23, 2026
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range
(127 levels) instead of int6 (31 levels) — 4x more quantization
precision for the most damage-prone parameters.

Default: INT8_SENSITIVE=attn.proj (attention output projections,
which suffer ~3.4x more quant damage per PR #481 analysis).

Controlled via env var, comma-separated patterns. Empty = disabled.
Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy).
We have 0.44MB headroom (15.56MB artifact, 16MB limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mrdavtan
Copy link
Copy Markdown
Author

Updating to sliding window eval. New scores in progress. The per-layer lr groups, cosine scheduling, and ablation findings are independent of the evaluation method.

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
…enai#486)

- 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay
- per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc
- DDP gradient sync via all_reduce(AVG) + grad clip 1.0
- keep LeakyReLU(0.5)^2 from exp48
- expected: ~0.06 BPB gain (1.127 → ~1.07)
- modal timeout 3600s for 30-epoch TTT
@mrdavtan
Copy link
Copy Markdown
Author

Closing: multi-epoch TTT is invalid per the clarified rules in #402. The per-layer LR and cosine scheduling contributions remain available for legal single-pass (Case 2) TTT implementations.

@mrdavtan mrdavtan closed this Mar 23, 2026
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range
(127 levels) instead of int6 (31 levels) — 4x more quantization
precision for the most damage-prone parameters.

Default: INT8_SENSITIVE=attn.proj (attention output projections,
which suffer ~3.4x more quant damage per PR openai#481 analysis).

Controlled via env var, comma-separated patterns. Empty = disabled.
Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy).
We have 0.44MB headroom (15.56MB artifact, 16MB limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range
(127 levels) instead of int6 (31 levels) — 4x more quantization
precision for the most damage-prone parameters.

Default: INT8_SENSITIVE=attn.proj (attention output projections,
which suffer ~3.4x more quant damage per PR openai#481 analysis).

Controlled via env var, comma-separated patterns. Empty = disabled.
Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy).
We have 0.44MB headroom (15.56MB artifact, 16MB limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 25, 2026
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's
proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025).

Changes:
- train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams
- run_3seeds.sh: Added TTT env vars for 3-seed validation
- finalize_submission.py: Extracts pre/post TTT metrics from logs
- README.md + submission.json: Updated for TTT-enabled submission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD
TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x).
3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 25, 2026
On PR openai#549: Replace 3ep SGD TTT (-0.0025 bpb) with PR openai#481's AdamW
recipe (30ep cosine decay, per-layer LR: mlp.proj 3x, mlp.fc 0.5x).
3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB, eval ~589s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant