XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092
Open
teddyoweh wants to merge 3 commits intoopenai:mainfrom
Open
XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092teddyoweh wants to merge 3 commits intoopenai:mainfrom
teddyoweh wants to merge 3 commits intoopenai:mainfrom
Conversation
|
Excellent combination of tweaks that synergize with more aggressive TTT. I'm surprised that the 15x learning rate was better, nice finding! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Results
val_bpb: 1.1219 | Artifact: 15,916,230 bytes (15.92 MB) | 8×H100 SXM
What's New
Three independently validated improvements on top of the PR #414 + PR #399 stack:
1. XSA on All 11 Layers (
XSA_LAST_N=11)Extending eXtended Self-Attention from last 4 layers to all 11 yields -0.0007 BPB. The richer attention outweighs ~4% slower step time (93.97ms vs ~90ms).
2. LeakyReLU(0.75)²
Higher negative slope than the current SOTA (0.75 vs 0.5). From PR #977's ablation, 0.75 is strictly better than 0.5 for the int6 stack. Preserves more gradient flow through the MLP.
3. Aggressive Legal TTT (lr=0.03)
Score-first TTT using PR #461's legal framework with a 15× higher learning rate (0.03 vs 0.002). Delivers -0.0033 BPB improvement (vs -0.0025 in SOTA). All blocks unfrozen, SGD with momentum 0.9, 3 epochs per chunk, cosine LR decay.
torch.inference_mode()guarantees scoring is stateless — weights are only updated AFTER the chunk is scored.FA3 Fallback
Script includes automatic fallback from Flash Attention 3 to PyTorch SDPA:
Our run used SDPA (93.97ms/step → 6,173 steps). With FA3 (~84ms/step → ~7,100 steps), expected BPB would be in the 1.119x range.
Timing
Run Command
Credits