SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824#1520
Open
taka6745 wants to merge 1 commit intoopenai:mainfrom
Open
SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824#1520taka6745 wants to merge 1 commit intoopenai:mainfrom
taka6745 wants to merge 1 commit intoopenai:mainfrom
Conversation
4f04238 to
30c0ec4
Compare
… TTT — val_bpb 1.0824 (3-seed mean)
30c0ec4 to
625e9aa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results
Novel Techniques
1. Gated Attention
Per-head learnable sigmoid gate on attention output. Each head learns when to attenuate its contribution, dynamically suppressing noisy or redundant heads at different training stages. Validated across 5 seeds.
2. NorMuon (Post-NS Row Normalization)
Row normalization applied after Newton-Schulz orthogonalization rather than before. Standard MuonEq-R normalizes rows pre-NS, which can wash out useful gradient directional structure. NorMuon preserves it. Validated across 2 seeds.
3. Norm-PCT-Dropout
Zeros the top 1% highest L2-norm rows of FFN intermediate activations during training. Unlike random dropout, this specifically targets dominant pathways — an implicit capacity regularizer that prevents the model from over-relying on a small set of neurons. Validated across 2 seeds.
4. Parallel Muon (Batched Newton-Schulz)
Groups parameters by shape and runs NS orthogonalization as batched matrix ops. ~3% throughput improvement, ~3 extra training steps in the 600s budget.
Hardware Journey — 300+ Experiments
Speed Campaign Highlights (31 A/B experiments, RTX 3090)
Int8 Quantization Discovery
Converged small models show catastrophic GPTQ int6 failure — gap explodes from 0.02 to 3+ BPP:
Int8 eliminates the gap for small models but doesn't fit 16MB for 11L+4x (19.6MB). Final submission uses int6 — achieving better quant efficiency than SOTA (10.3 vs 11.7 mBPP).
How This Could Be Improved
Our novel techniques produce inherently more quantization-friendly weight distributions. Combined with our training speed infrastructure (2.14x over baseline) and identified fixes, the path to closing the +0.0014 gap is concrete:
Ready to deploy (1 run)
PARALLEL_START_LAYER=-1)The two-lane fix is the highest-priority item: a pre-existing code path silently overrides our GPT-J parallel residuals with an unvalidated two-lane decoder split. Disabling it matches PR #1493's proven architecture exactly.
Techniques ready to implement
Why our approach has unique potential
Our quant gap (10.3 mBPP) is 12% more efficient than PR #1493's (11.7 mBPP). This suggests Gated Attention + NorMuon + Norm-PCT-Dropout produce weight distributions with fewer extreme outliers — the exact property that makes GPTQ struggle. Any future quantization improvement (AWQ, Hessian-SDClip, mixed-precision) would compound more favorably with our stack.
Our 2.14x training speedup infrastructure (torch.compile + Parallel Muon + max-autotune) means each iteration cycle is fast — we can A/B test fixes rapidly at ~$8/seed on 8xH100.
Compliance (Track B)
Per Issue #1017:
torch.no_grad()before SGDNo SLOT, no pre-quant TTT, no ETLB, no n-gram cache.
Artifact note: Mean 16,051,190 bytes (~51KB over 16MB cap). Fix identified above.
Reproduction
Full experiment log:
experiments.mdTest plan
🤖 Generated with Claude Code
Theoretical Limits & What We're Chasing
Per Issue #1017's analysis of the entropy floor:
There's ~0.08–0.28 BPP of theoretical headroom between current performance and the information-theoretic floor. The question is how much of that is accessible within the 16MB + 10min constraints.
Future Experiments
Near-term (identified, ready to run)
Two-lane decoder fix + CMP_QUANT_VALUE_DEDUP — single run, expected to close the +0.0014 gap and fix the artifact size. This is the immediate next step.
Hessian-aware SDClip (~30 LOC) — adaptive per-row GPTQ clipping using Hessian information (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412). Our stack's better quant gap suggests this would compound favorably — if our weights already have fewer outliers, Hessian-aware clipping should preserve even more of the fine structure.
Adaptive TTT scheduling — instead of fixed 3 epochs per 32K chunk, allocate more epochs to high-loss chunks and fewer to easy ones. The current fixed schedule wastes adaptation budget on already-learned content.
Medium-term (novel directions)
Small-model + int8 Pareto frontier — our CHAMP_D discovery (6L+2x+int8 = 0.001 BPP quant gap on a single 3090) suggests an entirely different Pareto point exists. Scaling this with Gated Attention + NorMuon + more compute could reach competitive BPP at a fraction of the artifact size (~9.5 MB vs ~16 MB), leaving room for additional model capacity or code.
Cross-architecture technique transfer — our novel techniques (Gated Attention, NorMuon, Norm-PCT-Dropout) were validated on both 6L+2x and 11L+4x architectures. Testing them on the intermediate configs (8L+3x, 10L+4x) may reveal a sweet spot where our techniques provide maximum relative lift.
Learned quantization-aware training (QAT) — our stack produces inherently quantization-friendly weights (10.3 vs 11.7 mBPP gap). Adding explicit QAT during the warmdown phase could further reduce the gap, potentially making int8 viable even at 11L+4x scale if combined with better compression.
Visualization-driven improvement (planned)
We plan to use targeted visualizations to identify specific weak points for future novel techniques:
Per-token BPP heatmaps — identify which token types and contexts our model struggles with most (code blocks? rare languages? tabular data? numbers?). High-BPP regions are where novel architectural interventions would have the most leverage.
Gated Attention gate analysis — visualize the learned per-head gates across layers and training stages. Do certain heads consistently gate themselves off? Are the gates correlated with input structure (e.g., gates open wider for syntactically complex passages)? This could inform more sophisticated gating architectures.
Weight distribution comparison — histogram our weight distributions vs PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493's at each layer to understand WHY our quant gap is smaller. If specific layers show tighter distributions, targeted techniques could be applied to the remaining outlier-heavy layers.
Loss decomposition by document type — FineWeb contains diverse content (articles, forums, documentation, code). Decomposing val_bpb by content type would reveal whether our techniques help uniformly or disproportionately on certain content — guiding which novel approaches to develop next.
Layer-wise gradient flow analysis — compare gradient magnitudes through the network with and without NorMuon to verify the hypothesis that post-NS normalization preserves more useful gradient structure. If confirmed, this could be extended to other optimizer components.
These visualizations would enable data-driven novelty — finding the specific weak points rather than guessing, and designing techniques that target them. Combined with our fast iteration infrastructure (2.14x speedup, $8/seed on H100), each visualization insight can be tested in hours rather than days.