Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
Open
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
Open
Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
Conversation
Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)
…penai#549) - train_seq_len and eval_seq_len raised 2048 -> 4096 - All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate, BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA, GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT - Dynamic NTK auto-scales rope_base to ~48550 for 4096 context - SDPA fallback added for flash_attn_3 unavailability (local testing) - rocm-smi fallback for nvidia-smi on ROCm hardware - Update QAT Int4 expected BPB estimate to ~1.13-1.14
Fixes: - LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0) but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision - QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant functions and switches export from int6 to int4 New scripts: - 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0 - 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus _fake_quant_int4_bank() applied to all bank weight slices in the forward pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared Also: add zstandard to requirements.txt; add missing README/submission.json
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest
…, TTT epochs=1/freeze=4/lr=0.001
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Submission
Experiment: `records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/`
Strategy: Pure Velocity & TTT Preservation
Initial attempts tried to maximize model capacity (GatedAttention, ValueResidual, BigramHash=2048). Ablations showed these features add ~1.5ms/step overhead and destabilize TTT, costing more in training steps than they gain in quality under the 10min/16MB constraint. The winning strategy strips the model to its leanest form.
Results (8×H100 80GB SXM)
Key Changes
GATED_ATTENTION=0,VALUE_RESIDUAL=0SWA_ENABLED=0BANK_QAT_THRESHOLD=0LATE_QAT_THRESHOLD=0.15TRAIN_SEQ_LEN=2048Features Explored but Disabled
These were implemented and tested but hurt under the 10min/16MB constraint. They remain in the codebase and are expected to help significantly with more budget:
Headroom & Scaling Evidence
Submission sits at 15.851–15.888MB across seeds (mean 15.866MB) — ~134KB under the 16MB limit. Attempts to fill headroom (BigramHash=1664, 2048) produced worse BPB and exceeded the size limit. In an uncapped scenario, all disabled levers can be opened simultaneously for significantly better BPB.