Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation#1341
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Conversation
Evidence from 4 independent configurations (PR openai#461, PR openai#601, PR openai#1326, and my own experiments) showing GPTQ's compensatory weight structure is destroyed by SGD-based test-time training. Key finding: SGD TTT gives -0.0165 BPB on simple int6 but provides negligible to negative improvement on GPTQ-quantized models (-0.0001 to +0.030 BPB). Includes complete SGD TTT implementation (sgd_ttt_eval.py) following PR openai#461 protocol, and LoRA TTT implementation (clark_ttt_eval.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test-time training (TTT) provides substantial BPB improvement on simple quantization but is fundamentally ineffective on GPTQ-quantized models. This work aggregates evidence from 4 independent configurations across 3 research groups showing that GPTQ's compensatory weight structure is destroyed by gradient-based adaptation, making TTT and GPTQ mutually exclusive optimization strategies.
This finding has immediate implications for the competition: teams using GPTQ (the dominant compression method) cannot benefit from TTT at eval time.
Evidence
The pattern is stark: SGD TTT improves BPB by -0.0165 on simple int6 quantization (PR #461) but provides zero benefit on GPTQ-quantized weights. When applied aggressively to GPTQ models, TTT actively degrades performance by +0.030 BPB (PR #601).
My LoRA TTT experiment used rank-8 adapters on Q and V projections of a GPTQ-quantized Clark-architecture model (11L, 512d, sp4096). Even this conservative approach — updating only ~2% of parameters — yielded negligible improvement (-0.0013 BPB).
PR #1326 (aryanbhosale) independently confirmed this: applying score-first TTT to the strongest current architecture (depth recurrence + parallel residuals + GPTQ int6) produced -0.0001 BPB improvement — statistically indistinguishable from zero.
Root Cause: GPTQ's Compensatory Weight Structure
GPTQ (Frantar et al., 2023) solves a per-layer Hessian-weighted least-squares problem:
Each quantized weight compensates for errors in previously quantized weights. The resulting weight matrix is not independently quantized — it's a globally optimized system where individual weights encode error-correction information for their neighbors.
SGD updates individual weights based on local gradients, ignoring the compensatory structure. After even one SGD step:
This is why TTT on GPTQ is not merely unhelpful — it can be actively harmful (+0.030 BPB in PR #601).
Implication: Compression vs Adaptation Tradeoff
The competition has two parallel optimization strategies that cannot be combined:
Compression path (GPTQ):
Adaptation path (TTT):
Teams must choose one. The current leaderboard shows GPTQ winning — but this may change if someone finds a way to bridge the gap.
Proposed Fix Directions
Quantization-aware TTT: Maintain full-precision master weights alongside GPTQ weights. Run TTT on masters, re-quantize per chunk. Preserves GPTQ structure while allowing adaptation. Cost: 2× memory + re-quantization overhead.
Structured TTT: Constrain SGD updates to respect GPTQ block boundaries. Only update weights in ways that maintain the compensatory structure. Requires understanding GPTQ's column ordering.
Higher-rank LoRA: My rank-8 LoRA gave -0.0013. Higher ranks (32, 64) may provide enough adaptation capacity without disturbing GPTQ weights. But higher rank = more parameters = potential artifact overhead.
Simple int6 + larger model: Skip GPTQ entirely. Use simple int6 with a model small enough to fit 16MB. TTT then provides -0.0165 BPB. The question: does the GPTQ compression advantage (larger model) outweigh the TTT adaptation advantage (better eval)?
None of these have been attempted in the competition.
SGD TTT Implementation
I implemented the full PR #461 TTT protocol: SGD with momentum=0.9, lr=0.002, cosine decay across 32K-token chunks, 3 epochs per chunk, freeze first 2 blocks, grad clip 1.0. Code:
sgd_ttt_eval.pyWhen applied to a GPTQ-quantized Clark 11L model (val_bpb ~1.10 pre-TTT), the result was -0.0013 BPB — consistent with PR #1326's finding of -0.0001 on a similar architecture.
Reproduction
# Run SGD TTT on a GPTQ-quantized model: python3 sgd_ttt_eval.py \ --model-path final_model.int6.ptz \ --data-dir ./data/ \ --ttt-lr 0.002 --ttt-epochs 3 \ --ttt-chunk-size 32768 --ttt-freeze-blocks 2Attribution
Analysis aggregates findings from PR #461 (Christopher-Lee-McClendon), PR #601 (community), PR #1326 (aryanbhosale), and my own experiments. GPTQ analysis based on Frantar et al. (2023). All experiments self-funded.