[Non-Record] Hymba-LongContext: 32K context training and eval via hybrid SSM (1.1873 BPB)#914
Open
mkenney2 wants to merge 1 commit intoopenai:mainfrom
Open
[Non-Record] Hymba-LongContext: 32K context training and eval via hybrid SSM (1.1873 BPB)#914mkenney2 wants to merge 1 commit intoopenai:mainfrom
mkenney2 wants to merge 1 commit intoopenai:mainfrom
Conversation
… SWA (1.1873 BPB) Hybrid Mamba + Sliding Window Attention architecture enabling 32K context training with near-constant per-step cost. Going from 8K to 64K context adds virtually no overhead (~80-83 ms/step) because both SWA and Mamba have constant per-token cost. Mean val_bpb: 1.1873 +/- 0.0008 (3 seeds) Training: 600s on 8xH100, ~82 ms/step Evaluation: Score-first TTT, ~67-74s Artifact: int6 + zstd-22, ~14.5 MB
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This submission demonstrates that hybrid SSM architectures can train at 32x longer context (32,768 tokens) than the standard baseline (1,024 tokens) with near-constant cost as context length increases. By combining Mamba (selective state space model) with sliding window attention (SWA-512), both branches have constant per-token cost, going from 8K to 64K context adds virtually no overhead (~80-83 ms/step). This enables ultra-long context training within the 10-minute wall-clock budget.
Results
Key Innovations
1. Ultra-Long Context Training (32K tokens)
The naive transformer baseline trains at 1,024 token context. We train at 32,768 tokens — a 32x increase. Increasing context from 8K to 64K adds virtually no overhead (~80-83 ms/step across that range). This is because both SWA and Mamba have constant per-token cost as SWA attends to a fixed 512-token window regardless of sequence length, and Mamba's recurrent scan processes each token in O(1). Since the total tokens per batch is fixed (524K), the step time stays roughly constant regardless of how those tokens are divided into sequences.
This is made possible by two architectural choices:
The Mamba branch handles global context (full sequence memory via recurrent state), while SWA handles local pattern matching.
2. Hymba Hybrid Architecture
Based on the Hymba paper (arXiv:2411.13676), each block runs attention and Mamba in parallel within a single layer:
Architecture: 7 layers, 512 dim, MLP 4x with LeakyReLU(0.9)^2, U-Net skip connections, SmearGate + BigramHash embedding, EMA (0.997).
3. Score-First Test-Time Training (TTT)
Legal TTT following the PR #461 recipe:
inference_mode(no gradient)TTT improves post-quantization BPB by adapting the quantized model to validation data patterns.
4. Context Length Scaling Results
We systematically evaluated training context length while keeping all other hyperparameters fixed:
Per-step cost is nearly constant from 8K to 64K context because the per-token cost of both SWA and Mamba is independent of sequence length (see above). Quality improves with longer context up to ~32K, then plateaus.
Why This Matters
With constant per-token cost, our architecture can train and evaluate at long context without the quadratic overhead that full attention would incur. This frees the eval time budget for TTT adaptation rather than expensive sliding window overlap.
The competition README specifically requests "state-space models" and "super long context for evaluation or training" as novel directions. This submission demonstrates both, showing that hybrid SSM architectures naturally enable ultra-long context training regimes.
Run Command
Dependencies
Setup
Requires PyTorch >= 2.5 for flex_attention (sliding window). Tested on PyTorch 2.8.0+cu128.
Note:
--no-build-isolationis critical to avoid CUDA version mismatch during the mamba-ssm/causal-conv1d CUDA kernel builds.