[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT#1601
Open
SPThole wants to merge 13 commits intoopenai:mainfrom
Open
[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT#1601SPThole wants to merge 13 commits intoopenai:mainfrom
SPThole wants to merge 13 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT
This PR investigates replacing the inner-loop SGD optimizer inside our FOMAML meta-TTT setup with Sharpness-Aware Minimization (SAM). Following the findings from #1502 (in which MetaSGD failed to raise the TTT capability ceiling and per-layer learning rates collapsed to uniform
1.0), we hypothesized that adapting over a flatter local minimum in the inner loop could yield a starting point more amenable to generalized Test-Time Training.However, this experiment solidifies our central conclusion across this entire phase: the TTT adaptation ceiling is firmly architecture-limited. SAM failed to improve the TTT delta, proving that the initialization basin is already optimal for the current parameterization (bank dimensionalities).
TL;DR — Key Learnings for the Community
~0.023 bpbTTT improvement delta. Changing the meta-training inner-optimizer cannot overcome dimensional constraints of the bank parameters.0.999). There is simply no geometric sharpness to avoid.autograd.grad. SAM consumed32.4 GBof peak memory, an unexpectedArchitecture Overview
This iteration inherits the identical foundational architecture as exp106.
Innovation — What This PR Introduces
Motivation: Why SAM was attempted
#1502 utilized MetaSGD but showed that the 66 learned per-layer learning rate scales converged right back to their initial value of
1.0. MetaSGD looks for scale differentiation, but if the local curvature is generally isotropic, learning rates won't diverge.Sharpness-Aware Minimization (SAM) attempts something different. It perturbs the weights to a local point of maximum loss before computing the gradient for the update.
By performing the inner FOMAML loop explicitly using
g_SAM, we explicitly penalize the meta-learner if it places the initialization into a sharp minimum where small TTT adaptations hurt generalization.Why It Was Expected to Help:
If the baseline models were adapting into brittle local minimas during the few-shot evaluation inference stage, SAM would widen the valleys, allowing the TTT steps to travel further without spiking the validation loss.
Results
exp107 — SAM Inner-Loop TTT
exp101,exp105a,exp106Analysis
The Verification of TTT Invariance
This marks the 4th consecutive experiment validating an identical TTT delta ranging from
-0.0230to-0.0234bpb.Weight-Space Evaluation
We performed a mode connectivity and subspace analysis comparing this pr back to exp #1502
0.20250.839Final Verdict & Recommendation
Verdict: SAM hurts — discard.
The TTT delta remains stagnant, showing that the optimizer modification fails to achieve its primary mandate. Furthermore, the absolute legal evaluated score is worse than both #1502 and the simple ablation #1501
With an added unpredicted memory penalty slowing down throughput, the cost-benefit ratio is unequivocally negative.
Future Recommendation: The Meta-TTT line of investigation is officially concluded. With 4 separate exhaustive variants yielding perfectly invariant adaptation deltas, it is definitive that the cap is determined by the bank rank-dimensionality (
dim=64 × TTT_epochs=4). All future parameter exploration should pivot immediately to structurally raising the bank matrix dimensionality, or swapping the offline TTT accumulation optimizer (AdaGrad/RMSProp).