Skip to content

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347

Open
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l
Open

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

Conversation

@FlashyFlash3011
Copy link
Copy Markdown

@FlashyFlash3011 FlashyFlash3011 commented Mar 21, 2026

Submission

Experiment: `records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/`


Strategy: Pure Velocity & TTT Preservation

Initial attempts tried to maximize model capacity (GatedAttention, ValueResidual, BigramHash=2048). Ablations showed these features add ~1.5ms/step overhead and destabilize TTT, costing more in training steps than they gain in quality under the 10min/16MB constraint. The winning strategy strips the model to its leanest form.

Results (8×H100 80GB SXM)

Seed step_avg Steps Pre-TTT BPB Post-TTT BPB TTT Gain TTT Time Artifact
1337 83.87ms 7155 1.12163921 1.11901233 -0.00262688 421.9s 15.851MB
42 83.86ms 7156 1.12228806 1.11960558 -0.00268248 423.2s 15.858MB
2025 83.89ms 7154 1.12197720 1.11920302 -0.00277418 423.4s 15.888MB
Mean 83.87ms 7155 1.12196816 1.11927364 -0.00269451 422.8s 15.866MB

Key Changes

Change Why
GATED_ATTENTION=0, VALUE_RESIDUAL=0 +1.5ms/step overhead → 130+ lost training steps in 600s
SWA_ENABLED=0 Was copying hundreds of MB GPU→CPU every 50 steps — EMA is used at the end, not SWA
BANK_QAT_THRESHOLD=0 Was snapping FP32 TTT weights back to Int6 mid-evaluation, causing catastrophic forgetting
LATE_QAT_THRESHOLD=0.15 QAT only in final 15% of warmdown — no overhead during main training
TRAIN_SEQ_LEN=2048 Allows full warmdown (7155 steps vs ~5776 at 4096 ctx)

Features Explored but Disabled

These were implemented and tested but hurt under the 10min/16MB constraint. They remain in the codebase and are expected to help significantly with more budget:

Feature Why disabled Why it helps with more budget
GatedAttention, ValueResidual +1.5ms/step → 130+ lost steps Legitimate architectural gains with 30min+ training
BigramHash=2048 Pushed artifact over 16MB Better subword context modeling
QAT from step 1 Overhead throughout training Full-run quant adaptation reduces post-quant degradation
BANK_QAT_THRESHOLD > 0 Corrupts TTT weights Enables aggressive compression of larger models

Headroom & Scaling Evidence

Submission sits at 15.851–15.888MB across seeds (mean 15.866MB) — ~134KB under the 16MB limit. Attempts to fill headroom (BigramHash=1664, 2048) produced worse BPB and exceeded the size limit. In an uncapped scenario, all disabled levers can be opened simultaneously for significantly better BPB.

Two new submissions targeting sub-1.1698 BPB:

1. 2026-03-21_LongContext4096_FullStack
   - 4096-token training context + full modern SOTA stack
   - Sliding window eval stride=256 (3840 context tokens per position)
   - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch
   - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context

2. 2026-03-21_QAT_Int4_16L
   - Int4 nibble-packing enables 16 transformer layers in 16MB budget
   - QAT with straight-through estimator activates at 15% of training
   - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%)
- rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62)
  instead of naive 4x multiplication)
…penai#549)

- train_seq_len and eval_seq_len raised 2048 -> 4096
- All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate,
  BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA,
  GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT
- Dynamic NTK auto-scales rope_base to ~48550 for 4096 context
- SDPA fallback added for flash_attn_3 unavailability (local testing)
- rocm-smi fallback for nvidia-smi on ROCm hardware
- Update QAT Int4 expected BPB estimate to ~1.13-1.14
Fixes:
- LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0)
  but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision
- QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant
  functions and switches export from int6 to int4

New scripts:
- 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with
  QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0
- 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus
  _fake_quant_int4_bank() applied to all bank weight slices in the forward
  pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared

Also: add zstandard to requirements.txt; add missing README/submission.json
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers Mar 25, 2026
FlashyFlash3011 and others added 18 commits March 25, 2026 18:29
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB:
- train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps)
- bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings)
- xsa_last_n: 11 → 4 (from user's best experiments)
- gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement)
- Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown
- Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format
- Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit)
- Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules
- Budget-aware ±1 weight pruning retained from NewTest
@FlashyFlash3011 FlashyFlash3011 deleted the flashyflash3011/long-context-4096-qat-int4-16l branch March 27, 2026 13:06
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 27, 2026
@FlashyFlash3011 FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:17
@FlashyFlash3011 FlashyFlash3011 marked this pull request as draft March 30, 2026 14:18
@FlashyFlash3011 FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:28
@FlashyFlash3011 FlashyFlash3011 changed the title GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant