Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481
Closed
mrdavtan wants to merge 1 commit intoopenai:mainfrom
Closed
Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds)#481mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan wants to merge 1 commit intoopenai:mainfrom
Conversation
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
ndokutovich
added a commit
to ndokutovich/parameter-golf
that referenced
this pull request
Mar 23, 2026
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR #481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Updating to sliding window eval. New scores in progress. The per-layer lr groups, cosine scheduling, and ablation findings are independent of the evaluation method. |
4 tasks
sofiabod
added a commit
to sofiabod/parameter-golf
that referenced
this pull request
Mar 23, 2026
…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT
Author
|
Closing: multi-epoch TTT is invalid per the clarified rules in #402. The per-layer LR and cosine scheduling contributions remain available for legal single-pass (Case 2) TTT implementations. |
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR openai#481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR openai#481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
This was referenced Mar 25, 2026
This was referenced Mar 25, 2026
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 25, 2026
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025). Changes: - train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams - run_3seeds.sh: Added TTT env vars for 3-seed validation - finalize_submission.py: Extracts pre/post TTT metrics from logs - README.md + submission.json: Updated for TTT-enabled submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 25, 2026
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 25, 2026
On PR openai#549: Replace 3ep SGD TTT (-0.0025 bpb) with PR openai#481's AdamW recipe (30ep cosine decay, per-layer LR: mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB, eval ~589s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 28, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb=1.0970 (3-seed mean, std=0.0010). 15.4-15.8 MB artifact. 8xH100 SXM, FA2.
Training architecture follows the community stack. The main change from prior work is in the TTT schedule. All runs used FA2; FA3 Hopper would improve pre-TTT quality through faster training steps. The schedule is independent of the attention kernel and should apply to any architecture.
TTT scheduling
Two modifications to AdamW TTT (PR #442):
Cosine lr decay over 30 epochs. Starts at full lr to address large-scale quantization damage, progressively reduces to refine without overshooting. Flat lr must compromise between these two regimes.
Per-layer lr groups based on quantization damage. MLP output projections showed 3.4× higher relative quantization error than input projections on our trained checkpoint. TTT receives: 3× base lr for output projections, 0.5× for input projections, 1× for the rest. Ratios are model-specific.
30 epochs at ~15.5s/epoch = ~465s total. We also tested flat lr, SGD, focal loss, and KL divergence from pre-quant model. Focal loss and KL divergence did not improve over cross-entropy. Full comparison in the README.
Other findings
See PR #212 for a non-record submission documenting 25+ experiments with negative results on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization, and depth recurrence.
Acknowledgments
Reproduction
Hardware: 8xH100 SXM (RunPod), PyTorch 2.9.1+cu128, Flash Attention 2