[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT by SPThole · Pull Request #1601 · openai/parameter-golf

SPThole · 2026-04-13T17:56:02Z

PR: Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT

Track: non record Hardware: 1×H100 80 GB SXM
Legal_ttt: 1.11898 | TTT delta: −0.02337 bpb
Status: Non-record exploration

This PR investigates replacing the inner-loop SGD optimizer inside our FOMAML meta-TTT setup with Sharpness-Aware Minimization (SAM). Following the findings from #1502 (in which MetaSGD failed to raise the TTT capability ceiling and per-layer learning rates collapsed to uniform 1.0), we hypothesized that adapting over a flatter local minimum in the inner loop could yield a starting point more amenable to generalized Test-Time Training.

However, this experiment solidifies our central conclusion across this entire phase: the TTT adaptation ceiling is firmly architecture-limited. SAM failed to improve the TTT delta, proving that the initialization basin is already optimal for the current parameterization (bank dimensionalities).

TL;DR — Key Learnings for the Community

The TTT adaptation ceiling is strictly set by the architecture. Just like the original same-batch FOMAML, no-meta baseline, and the cross-document delta-loss+MetaSGD variants, the SAM inner-loop produces the exact same ~0.023 bpb TTT improvement delta. Changing the meta-training inner-optimizer cannot overcome dimensional constraints of the bank parameters.
SAM specifically fails because the bank geometry is already highly isotropic. Analysis reveals a uniform SV distribution (0.999). There is simply no geometric sharpness to avoid.
Multi-epoch TTT evaluations erase initialization signal. A 4-epoch standalone TTT evaluation overshoots any minute initialization flatness bias we carefully baked in with 1-step meta-training. The $128 \times$ magnitude gap between meta-step sizes and TTT accumulation obliterates the SAM advantage.
SAM carries large hidden memory penalties. SAM computes the gradient at an ascent-perturbed point. This means holding all 11 layers simultaneously for autograd.grad. SAM consumed 32.4 GB of peak memory, an unexpected $+0.7$ GB penalty over MetaSGD, reducing useful training batches by roughly 89 steps.

Architecture Overview

This iteration inherits the identical foundational architecture as exp106.

Component	Configuration
Model	11-layer U-Net GPT (5 encoder + 6 decoder with skips)
Hidden dim	512
Attention	8Q / 4KV (GQA)
XSA	All 11 blocks
Bigram	4096×64, position-conditional logic
Meta-TTT Base	FOMAML every=4, Cross-chunk inner/outer split, Delta-Loss enabled

Innovation — What This PR Introduces

Motivation: Why SAM was attempted

#1502 utilized MetaSGD but showed that the 66 learned per-layer learning rate scales converged right back to their initial value of 1.0. MetaSGD looks for scale differentiation, but if the local curvature is generally isotropic, learning rates won't diverge.

Sharpness-Aware Minimization (SAM) attempts something different. It perturbs the weights to a local point of maximum loss before computing the gradient for the update.

1. Compute gradient g_0 at banks
2. Compute perturbed banks = banks + ρ * g_0 / ||g_0||
3. Compute inner-step gradient g_SAM at perturbed banks
4. banks' = banks - α * g_SAM

By performing the inner FOMAML loop explicitly using g_SAM, we explicitly penalize the meta-learner if it places the initialization into a sharp minimum where small TTT adaptations hurt generalization.

Why It Was Expected to Help:

If the baseline models were adapting into brittle local minimas during the few-shot evaluation inference stage, SAM would widen the valleys, allowing the TTT steps to travel further without spiking the validation loss.

Results

exp107 — SAM Inner-Loop TTT

Metric	Value	Note
Steps completed	6597 / 7500	Wallclock cap hit due to SAM memory overhead
Post-EMA Float Baseline	1.1384
Int6 Canonical Legal_ttt	1.11898	Full 947/947 chunks evaluated successfully
TTT Delta	−0.02337	Matches `exp101`, `exp105a`, `exp106`
Peak GPU memory	32,397 MiB	$+0.7$ GB unaccounted memory bloat vs #1502
Total submission size	15.88 MB	Safely within the boundary constraint

Analysis

The Verification of TTT Invariance

Experiment	Baseline bpb	Post-TTT bpb	TTT delta
exp #1501 (Vanilla FOMAML)	1.13930	1.11588	−0.02342
exp #1502 (MetaSGD+Cross-Chunk)	1.13767	1.11469	−0.02299
this pr (SAM Inner Loop)**	1.13840	1.11898	−0.02337

This marks the 4th consecutive experiment validating an identical TTT delta ranging from -0.0230 to -0.0234 bpb.

Weight-Space Evaluation

We performed a mode connectivity and subspace analysis comparing this pr back to exp #1502

Bank cosine similarity: 0.2025
Midpoint loss norm ratio: 0.839
Assessment: The midpoint ratio indicates they reside within effectively the exact same flat valley. The gradient perturbations injected by SAM simply walk the initialization into equivalent, neighboring basins without functionally changing the limits of the manifold.

Final Verdict & Recommendation

Verdict: SAM hurts — discard.

The TTT delta remains stagnant, showing that the optimizer modification fails to achieve its primary mandate. Furthermore, the absolute legal evaluated score is worse than both #1502 and the simple ablation #1501

With an added unpredicted memory penalty slowing down throughput, the cost-benefit ratio is unequivocally negative.

Future Recommendation: The Meta-TTT line of investigation is officially concluded. With 4 separate exhaustive variants yielding perfectly invariant adaptation deltas, it is definitive that the cap is determined by the bank rank-dimensionality (dim=64 × TTT_epochs=4). All future parameter exploration should pivot immediately to structurally raising the bank matrix dimensionality, or swapping the offline TTT accumulation optimizer (AdaGrad/RMSProp).

SPThole and others added 13 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

Merge branch 'openai:main' into main

940cb57

adding exps

de42f2b

Merge branch 'openai:main' into main

67e19da

Merge branch 'openai:main' into main

e9a3157

updt

d5f13d0

updt

8ff49c8

sam added

890eaff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT#1601

[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT#1601
SPThole wants to merge 13 commits intoopenai:mainfrom
SPThole:non_record_7

SPThole commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SPThole commented Apr 13, 2026

PR: Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT

TL;DR — Key Learnings for the Community

Architecture Overview

Innovation — What This PR Introduces

Motivation: Why SAM was attempted

Why It Was Expected to Help:

Results

exp107 — SAM Inner-Loop TTT

Analysis

The Verification of TTT Invariance

Weight-Space Evaluation

Final Verdict & Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant