SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824 by taka6745 · Pull Request #1520 · openai/parameter-golf

taka6745 · 2026-04-10T07:26:38Z

Summary

val_bpb = 1.0824 (3-seed mean, std 0.0004) | 8xH100 80GB HBM3 SXM
4 novel training-time techniques on the PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 architecture
Within +0.0014 BPP of current SOTA — better quantization efficiency (10.3 vs 11.7 mBPP gap)
557 commits, 300+ experiment runs across Mac, 3090, A6000, and 8xH100

Results

Seed	Pre-quant	Quantized	Sliding	TTT	Artifact
42	1.0898	1.1001	1.0833	1.0824	16,051,299
314	1.0894	1.0997	1.0827	1.0819	16,050,433
999	1.0903	1.1000	1.0828	1.0828	16,051,839
Mean	1.0898	1.0999	1.0829	1.0824	—

Novel Techniques

1. Gated Attention

Per-head learnable sigmoid gate on attention output. Each head learns when to attenuate its contribution, dynamically suppressing noisy or redundant heads at different training stages. Validated across 5 seeds.

2. NorMuon (Post-NS Row Normalization)

Row normalization applied after Newton-Schulz orthogonalization rather than before. Standard MuonEq-R normalizes rows pre-NS, which can wash out useful gradient directional structure. NorMuon preserves it. Validated across 2 seeds.

3. Norm-PCT-Dropout

Zeros the top 1% highest L2-norm rows of FFN intermediate activations during training. Unlike random dropout, this specifically targets dominant pathways — an implicit capacity regularizer that prevents the model from over-relying on a small set of neurons. Validated across 2 seeds.

4. Parallel Muon (Batched Newton-Schulz)

Groups parameters by shape and runs NS orthogonalization as batched matrix ops. ~3% throughput improvement, ~3 extra training steps in the 600s budget.

Hardware Journey — 300+ Experiments

Phase	Hardware	Runs	Key Discovery
Local prototyping	Mac (Apple Silicon / MLX)	~100+	Rapid iteration on novel techniques, n-gram bias exploration, BPE tokenizer experiments
Speed + validation	RTX 3090 (24 GB, $0.22/h)	~150+	31 A/B speed experiments (torch.compile 1.85x, architecture sweeps 6L-11L, MLP 2x-4x), NIGHT_MODE multi-seed validation campaign
Extended validation	RTX A6000 (48 GB, $0.33/h)	~30+	Multi-seed confirmation of all novel techniques, SP-8192 tokenizer training, CHAMP_D int8 quant discovery
Final submission	8xH100 SXM (80 GB, $21.52/h)	5 retries	Full 3-seed submission, int8 quant scale limitations discovered

Speed Campaign Highlights (31 A/B experiments, RTX 3090)

Config	ms/step	Speedup	Notes
Baseline (no compile)	2933	1.0x	—
+ torch.compile	1581	1.85x	Biggest single win
+ max-autotune	1526	1.92x	No CUDA graphs (rotary cache conflict)
+ Parallel Muon	1369	2.14x	Batched NS across same-shape params
NUM_LAYERS=6 + MLP=2	725	4.05x	Compute-efficient sweet spot
Extreme (dim=256)	343	8.55x	Speed record, quant unusable

Int8 Quantization Discovery

Converged small models show catastrophic GPTQ int6 failure — gap explodes from 0.02 to 3+ BPP:

Config	Pre-quant	Int6 Quant	Int8 Quant	Gap
CHAMP_A (11L, int6)	1.600	4.603	—	3.00
CHAMP_B (6L, int6)	1.399	4.966	—	3.57
CHAMP_D (6L, int8)	1.398	—	1.399	0.001

Int8 eliminates the gap for small models but doesn't fit 16MB for 11L+4x (19.6MB). Final submission uses int6 — achieving better quant efficiency than SOTA (10.3 vs 11.7 mBPP).

How This Could Be Improved

Our novel techniques produce inherently more quantization-friendly weight distributions. Combined with our training speed infrastructure (2.14x over baseline) and identified fixes, the path to closing the +0.0014 gap is concrete:

Ready to deploy (1 run)

Fix	Expected Impact	Status
Two-lane decoder override (`PARALLEL_START_LAYER=-1`)	-0.001 to -0.003 BPP	1 env var, prepped
CMP_QUANT_VALUE_DEDUP	Fixes 51KB artifact overshoot	Validated n=2, 1 env var

The two-lane fix is the highest-priority item: a pre-existing code path silently overrides our GPT-J parallel residuals with an unvalidated two-lane decoder split. Disabling it matches PR #1493's proven architecture exactly.

Techniques ready to implement

Technique	Expected Impact	Rationale
Hessian-aware SDClip (PR #1412)	-0.003 to -0.005 BPP	Per-row adaptive GPTQ clipping compounds with our better quant gap
Adaptive TTT epochs	-0.002 to -0.004 BPP	More epochs on hard chunks, fewer on easy — current fixed 3 wastes budget
Gated Attention + int8 on small model	Different Pareto point	CHAMP_D (6L+int8) = 1.399 BPP on a single 3090 with 0.001 quant gap — scaling with our techniques + more compute is unexplored

Why our approach has unique potential

Our quant gap (10.3 mBPP) is 12% more efficient than PR #1493's (11.7 mBPP). This suggests Gated Attention + NorMuon + Norm-PCT-Dropout produce weight distributions with fewer extreme outliers — the exact property that makes GPTQ struggle. Any future quantization improvement (AWQ, Hessian-SDClip, mixed-precision) would compound more favorably with our stack.

Our 2.14x training speedup infrastructure (torch.compile + Parallel Muon + max-autotune) means each iteration cycle is fast — we can A/B test fixes rapidly at ~$8/seed on 8xH100.

Compliance (Track B)

Per Issue #1017:

Condition 1 (Causality): Strictly causal sliding-window eval
Condition 2 (Normalized): Standard softmax, no n-gram cache, no logit biasing
Condition 3 (Score-before-update): Each chunk scored under torch.no_grad() before SGD
Condition 4 (Single pass): Each token scored exactly once

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache.

Artifact note: Mean 16,051,190 bytes (~51KB over 16MB cap). Fix identified above.

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
python3 data/cached_challenge_fineweb.py --variant sp8192
SEEDS=42,314,999 bash submission/dry_run.sh

Full experiment log: experiments.md

Test plan

3-seed validation (42, 314, 999)
Artifact under 16,000,000 bytes (51KB over — fix identified)
Training under 600s (~588s)
Eval under 600s (~490s)
Issue A Field Guide to Valid Submissions #1017 compliance verified
300+ experiments across 4 hardware tiers

🤖 Generated with Claude Code

Theoretical Limits & What We're Chasing

Per Issue #1017's analysis of the entropy floor:

Shannon estimated the entropy of English at ~1.0 bits per character. Modern LLM-based estimates place it at 0.7–0.8 for clean prose. FineWeb is web text — noisier, more heterogeneous, and harder to predict. The entropy floor for this distribution is likely 0.8–1.0 BPP.

Marker	BPP	Gap to us
Shannon entropy (English)	~1.0	~0.08 above us
Modern LLM estimate (clean prose)	~0.7–0.8	we're 0.28–0.38 above
FineWeb entropy floor	~0.8–1.0	~0.08–0.28 remaining headroom
Current SOTA (PR #1493)	1.0810	0.0014 above us
Our submission	1.0824	—

There's ~0.08–0.28 BPP of theoretical headroom between current performance and the information-theoretic floor. The question is how much of that is accessible within the 16MB + 10min constraints.

Future Experiments

Near-term (identified, ready to run)

Two-lane decoder fix + CMP_QUANT_VALUE_DEDUP — single run, expected to close the +0.0014 gap and fix the artifact size. This is the immediate next step.
Hessian-aware SDClip (~30 LOC) — adaptive per-row GPTQ clipping using Hessian information (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412). Our stack's better quant gap suggests this would compound favorably — if our weights already have fewer outliers, Hessian-aware clipping should preserve even more of the fine structure.
Adaptive TTT scheduling — instead of fixed 3 epochs per 32K chunk, allocate more epochs to high-loss chunks and fewer to easy ones. The current fixed schedule wastes adaptation budget on already-learned content.

Medium-term (novel directions)

Small-model + int8 Pareto frontier — our CHAMP_D discovery (6L+2x+int8 = 0.001 BPP quant gap on a single 3090) suggests an entirely different Pareto point exists. Scaling this with Gated Attention + NorMuon + more compute could reach competitive BPP at a fraction of the artifact size (~9.5 MB vs ~16 MB), leaving room for additional model capacity or code.
Cross-architecture technique transfer — our novel techniques (Gated Attention, NorMuon, Norm-PCT-Dropout) were validated on both 6L+2x and 11L+4x architectures. Testing them on the intermediate configs (8L+3x, 10L+4x) may reveal a sweet spot where our techniques provide maximum relative lift.
Learned quantization-aware training (QAT) — our stack produces inherently quantization-friendly weights (10.3 vs 11.7 mBPP gap). Adding explicit QAT during the warmdown phase could further reduce the gap, potentially making int8 viable even at 11L+4x scale if combined with better compression.

Visualization-driven improvement (planned)

We plan to use targeted visualizations to identify specific weak points for future novel techniques:

Per-token BPP heatmaps — identify which token types and contexts our model struggles with most (code blocks? rare languages? tabular data? numbers?). High-BPP regions are where novel architectural interventions would have the most leverage.
Gated Attention gate analysis — visualize the learned per-head gates across layers and training stages. Do certain heads consistently gate themselves off? Are the gates correlated with input structure (e.g., gates open wider for syntactically complex passages)? This could inform more sophisticated gating architectures.
Weight distribution comparison — histogram our weight distributions vs PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493's at each layer to understand WHY our quant gap is smaller. If specific layers show tighter distributions, targeted techniques could be applied to the remaining outlier-heavy layers.
Loss decomposition by document type — FineWeb contains diverse content (articles, forums, documentation, code). Decomposing val_bpb by content type would reveal whether our techniques help uniformly or disproportionately on certain content — guiding which novel approaches to develop next.
Layer-wise gradient flow analysis — compare gradient magnitudes through the network with and without NorMuon to verify the hypothesis that post-NS normalization preserves more useful gradient structure. If confirmed, this could be extended to other optimizer components.

These visualizations would enable data-driven novelty — finding the specific weak points rather than guessing, and designing techniques that target them. Combined with our fast iteration infrastructure (2.14x speedup, $8/seed on H100), each visualization insight can be tested in hours rather than days.

… TTT — val_bpb 1.0824 (3-seed mean)

taka6745 changed the title ~~Record: SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824~~ SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824 Apr 10, 2026

taka6745 force-pushed the submission/gated-attention-normuon-ttt branch 5 times, most recently from 4f04238 to 30c0ec4 Compare April 10, 2026 07:39

Record: SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal…

625e9aa

… TTT — val_bpb 1.0824 (3-seed mean)

taka6745 force-pushed the submission/gated-attention-normuon-ttt branch from 30c0ec4 to 625e9aa Compare April 10, 2026 07:41

taka6745 marked this pull request as ready for review April 10, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824#1520

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824#1520
taka6745 wants to merge 1 commit intoopenai:mainfrom
taka6745:submission/gated-attention-normuon-ttt

taka6745 commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taka6745 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Novel Techniques

1. Gated Attention

2. NorMuon (Post-NS Row Normalization)

3. Norm-PCT-Dropout

4. Parallel Muon (Batched Newton-Schulz)

Hardware Journey — 300+ Experiments

Speed Campaign Highlights (31 A/B experiments, RTX 3090)

Int8 Quantization Discovery

How This Could Be Improved

Ready to deploy (1 run)

Techniques ready to implement

Why our approach has unique potential

Compliance (Track B)

Reproduction

Test plan

Theoretical Limits & What We're Chasing

Future Experiments

Near-term (identified, ready to run)

Medium-term (novel directions)

Visualization-driven improvement (planned)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taka6745 commented Apr 10, 2026 •

edited

Loading