Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639
Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639Robby955 wants to merge 1 commit intoopenai:mainfrom
Conversation
GPTQ/TTT interaction study with three key findings: 1. Full GPTQ halves quantization gap (0.008 → 0.004 BPB) 2. AdamW TTT catastrophically destroys GPTQ-calibrated weights (+0.076 BPB) 3. SGD TTT preserves GPTQ quality; Born-rule SNR² provides conservative scaling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compliance UpdateAfter reviewing the organizer ruling on PR #606 (GPTQ calibration must fit within the training time budget), I've re-run with compliant timing: Training ends at 560s, GPTQ calibration runs at 560-600s, within the 600s training budget. Compliant Results (560s training + 40s GPTQ calibration = 600s total)
TTT config (matching PR #606's recipe): AdamW lr=1e-4, 3 epochs, zero weight decay, freeze first 9/11 blocks, 128K token chunks. The original 1.1158 score was from a 600s training run where GPTQ calibration ran outside the training budget — the same issue that affected PR #606. The compliant score with the same architecture is 1.1175. 3-seed validation on the compliant config is in progress. Will update the PR title and body once complete. |
|
This submission doesn't have a compliant /records submission (no submission.json or train logs), and it is using GPTQ-training-data calibration at eval time, which is disallowed. Closing for now. |
Results (Compliant)
Best compliant score: 1.1175 BPB (single seed 1337, 3-seed validation in progress)
Compliance
Training time budget (600s total on 8×H100 SXM):
Test-time training (TTT):
Score-first protocol — the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal/streaming TTT pattern confirmed legal by organizers (Issue #402, #677). Full-epoch TTT (train on all val data before scoring) is NOT used.
Artifact: 15.87 MB (code: ~94KB, compressed weights: ~15.78MB). Under 16,000,000 byte limit.
Key Contributions
1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)
Cholesky-based GPTQ with act-order column permutation and block-wise error compensation (block_size=128). Calibrated on 256 training batches with 1% diagonal damping.
2. XSA on all 11 layers
Cross-Sequence Attention on all transformer layers (vs last 4 in baseline). Provides extended context beyond training sequence length at eval time. Worth ~-0.0013 BPB.
3. SWA/EMA weight blending
Stochastic Weight Averaging over the final warmdown phase (16 snapshots every 50 steps), blended 50/50 with EMA (decay=0.997). Smooths weight landscape before quantization.
4. Score-first TTT (legal)
Sequential online adaptation: score chunk, then train on it with AdamW (lr=1e-4, 3 epochs, freeze first 9/11 blocks, 128K token chunks). Improves sliding BPB by -0.0007.
Architecture & Training
Credits
🤖 Generated with Claude Code