Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)#1212
Open
Gusanidas wants to merge 8 commits intoopenai:mainfrom
Open
Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)#1212Gusanidas wants to merge 8 commits intoopenai:mainfrom
Gusanidas wants to merge 8 commits intoopenai:mainfrom
Conversation
12-layer split-bank U-Net with window attention (size=512 on layers 2,4,6,8,10), mixed seq_len training (5 GPUs at 2048 + 3 GPUs at 6144), fused Triton LeakyReLU-squared MLP, sigmoid-gated skip connections, brotli+byte-shuffle compression, GPTQ int6, sliding window eval (stride=128, seq_len=6144). 5-seed results: 1.1094, 1.1101, 1.1103, 1.1119, 1.1126 (mean 1.1083) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gusanidas
commented
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Window Attention + Mixed Seq_Len Training
val_bpb: 1.1108 (5-seed mean, std 0.0013) | 1.8755 nats | ~15.73 MB | 8xH100 SXM, 600s | No TTT
I started from PR #1130 (KitchenSinkV2 Improved), which added split early/late LR banks, MiLe margin loss, cache+backout residual, residual lambdas, bigger bigram/VE, and FA3 on top of the PR #549 stack. On top of that, I ported the fused Triton MLP from PR #1072 and the sigmoid-gated skips + brotli+byte-shuffle compression from PR #1089. I also increased to 12 layers and tuned qk_gain to 2.5.
The two main contributions of this submission are window attention and mixed seq_len training, described below.
Results (8xH100 80GB SXM, 600s, no TTT)
Current merged SOTA (2026-03-25 AR Self-Gen GPTQ + XSA-all + BigramHash 3072x112): 1.11473 BPB.
Delta vs current merged SOTA: -0.0039 BPB (-0.0066 nats).
Window attention
Instead of full causal attention on every layer, layers 2, 4, 6, 8, and 10 use a sliding window of 512 tokens via Flash Attention 3's
window_sizeparameter. The remaining layers (0, 1, 3, 5, 7, 9, 11) keep full attention.The motivation was to enable training at longer sequence lengths without proportionally increasing compute. Full quadratic attention at seq_len=6144 is expensive, but with window attention on 5 of 12 layers, those layers run in O(n * w) instead of O(n^2), cutting the per-step cost significantly. The layers with full attention still give the model access to the full context.
I swept several configurations: window sizes (256, 512, 1024), which layers to window (sparse, dense, even), and how many layers. Window 512 on even-indexed layers was the sweet spot — enough layers windowed to get the speedup, enough full-attention layers to preserve long-range modeling.
At seq_len=2048 (where all tokens fit in a 512-wide window anyway for most positions), windowed attention adds a small overhead (~2-3%). The benefit kicks in at longer sequences: 15% faster at 4096, 21% at 6144, 25% at 8192.
Mixed seq_len training
Different GPUs train with different sequence lengths within the same step. In the final configuration, 5 GPUs train at seq_len=2048 and 3 GPUs train at seq_len=6144. The number of sequences per GPU is set so that the total ms per step stays roughly constant.
The idea came from noticing that the sliding-window eval (which uses long sequences) gave substantially better scores than the standard 2048-token eval, but training at long sequence lengths was slow. By having most GPUs train cheaply at 2048 and a few GPUs see long context at 6144, the model gets the best of both: high step throughput from the short-sequence GPUs and long-range learning from the long-sequence ones.
I ran an extensive sweep of seq_len combinations. Some findings:
For the final 8-GPU submission, I used 5x2048 + 3x6144, which balances throughput and long-context exposure.
Other changes
x += sigmoid(gate) * skipreplaces learned scalar skip weightsArtifact size (worst-case, seed 2)
Under the 16,000,000 byte limit.
Acknowledgments
This submission builds on many contributions from the parameter-golf community:
window_sizefor efficient window attentionReproducibility
The main training runs used the following command:
SEED=$SEED \ MATRIX_LR=0.024 MATRIX_LR_LATE=0.019 \ SCALAR_LR=0.020 SCALAR_LR_LATE=0.038 \ TIED_EMBED_LR=0.022 \ MUON_MOMENTUM=0.985 WARMDOWN_ITERS=4000 \ TRAIN_BATCH_TOKENS=589824 \ NUM_LAYERS=12 BIGRAM_VOCAB_SIZE=5120 VE_DIM=128 \ WINDOW_SIZE=512 WINDOW_ATTN_LAYERS=2,4,6,8,10 \ LOCAL_SEQS_PER_GPU=36,36,36,36,36,10,10,10 \ SEQS_PER_GPU=2048,2048,2048,2048,2048,6144,6144,6144 \ MAX_WALLCLOCK_SECONDS=600 \ torchrun --standalone --nproc_per_node=8 train_gpt.pybrotlineeds to be installed for the final artifact compression path. Flash Attention 3 (flash_attn_interface) is required.