Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) by Robby955 · Pull Request #639 · openai/parameter-golf

Robby955 · 2026-03-24T18:39:30Z

Results (Compliant)

Config	Seed	Sliding BPB	Artifact	Train Time	Status
Score-first TTT (3 epochs)	1337	1.1175	15.87 MB	560s train + 40s GPTQ = 600s	Compliant
No TTT	1337	1.1182	15.93 MB	560s train + 40s GPTQ = 600s	Compliant

Best compliant score: 1.1175 BPB (single seed 1337, 3-seed validation in progress)

Note: Previous headline numbers (1.1158, 3-seed mean 1.1163) were from runs where GPTQ calibration ran outside the 10-minute training budget. Those numbers have been retracted per the organizer ruling in Issue #677. All figures above use compliant timing where GPTQ calibration is counted within the training budget.

Compliance

Training time budget (600s total on 8×H100 SXM):

560s: Main training loop (Muon + Adam, ~6,230 steps)
40s: GPTQ int6 quantization (calibration on training data)
Total: 600s (within 10-minute budget)

Test-time training (TTT):
Score-first protocol — the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal/streaming TTT pattern confirmed legal by organizers (Issue #402, #677). Full-epoch TTT (train on all val data before scoring) is NOT used.

Artifact: 15.87 MB (code: ~94KB, compressed weights: ~15.78MB). Under 16,000,000 byte limit.

Key Contributions

1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)

Cholesky-based GPTQ with act-order column permutation and block-wise error compensation (block_size=128). Calibrated on 256 training batches with 1% diagonal damping.

Method	Quant Gap	Artifact Size
Naive int6 (grid search)	+0.0083 BPB	15.81 MB
Full GPTQ (Cholesky)	+0.0039 BPB	15.92 MB

2. XSA on all 11 layers

Cross-Sequence Attention on all transformer layers (vs last 4 in baseline). Provides extended context beyond training sequence length at eval time. Worth ~-0.0013 BPB.

3. SWA/EMA weight blending

Stochastic Weight Averaging over the final warmdown phase (16 snapshots every 50 steps), blended 50/50 with EMA (decay=0.997). Smooths weight landscape before quantization.

4. Score-first TTT (legal)

Sequential online adaptation: score chunk, then train on it with AdamW (lr=1e-4, 3 epochs, freeze first 9/11 blocks, 128K token chunks). Improves sliding BPB by -0.0007.

Architecture & Training

Base: PR Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589 (11L GEPA, 8 GQA heads / 4 KV, VE128, Late QAT)
Changed: XSA all 11 layers, Full GPTQ, SWA+EMA 50/50 blend, warmdown 4000
Training: ~6,230 steps at 89.9ms/step on 8×H100 SXM, 27.1M params
Artifact: 15.87 MB (code + lzma-compressed int6 weights)

Credits

@RoyiRa for PR Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589 base architecture
@thwu1, @unnir, @JoeProAI for components adopted in Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589
@saml212, @kshitizz36 for GPTQ reference implementations (Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609, Record: Full GPTQ + LeakyReLU² + Parallel Muon (3-seed mean 1.1180) #626)
@CiprianFlorin-Ifrim for int5+Soft-Round QAT reference (Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606)
@raahilshah for PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634 (XSA-all, selective pruning insights)

🤖 Generated with Claude Code

GPTQ/TTT interaction study with three key findings: 1. Full GPTQ halves quantization gap (0.008 → 0.004 BPB) 2. AdamW TTT catastrophically destroys GPTQ-calibrated weights (+0.076 BPB) 3. SGD TTT preserves GPTQ quality; Born-rule SNR² provides conservative scaling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Robby955 · 2026-03-25T03:51:26Z

Compliance Update

After reviewing the organizer ruling on PR #606 (GPTQ calibration must fit within the training time budget), I've re-run with compliant timing:

Training ends at 560s, GPTQ calibration runs at 560-600s, within the 600s training budget.

Compliant Results (560s training + 40s GPTQ calibration = 600s total)

Config	Seed	Pre-quant	Sliding BPB	Artifact
Compliant + TTT	1337	1.1374	1.1175	15.87 MB
Compliant, no TTT	1337	1.1375	1.1182	15.93 MB

TTT config (matching PR #606's recipe): AdamW lr=1e-4, 3 epochs, zero weight decay, freeze first 9/11 blocks, 128K token chunks.

The original 1.1158 score was from a 600s training run where GPTQ calibration ran outside the training budget — the same issue that affected PR #606. The compliant score with the same architecture is 1.1175.

3-seed validation on the compliant config is in progress. Will update the PR title and body once complete.

valerio-oai · 2026-03-25T06:18:44Z

This submission doesn't have a compliant /records submission (no submission.json or train logs), and it is using GPTQ-training-data calibration at eval time, which is disallowed. Closing for now.

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Robby955 changed the title ~~Non-record: Full GPTQ + EB-TTT + SWA/EMA (val_bpb=1.1173)~~ Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158) Mar 24, 2026

Robby955 changed the title ~~Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158)~~ Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163) Mar 24, 2026

Robby955 changed the title ~~Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163)~~ Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) Mar 25, 2026

valerio-oai closed this Mar 25, 2026

valerio-oai mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

Robby955 mentioned this pull request Mar 25, 2026

EBLS Learned Sharing (10min/16MB) #433

Closed

This was referenced Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Open

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:gptq-ebttt

Robby955 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Robby955 commented Mar 25, 2026

Uh oh!

valerio-oai commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Robby955 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results (Compliant)

Compliance

Key Contributions

1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)

2. XSA on all 11 layers

3. SWA/EMA weight blending

4. Score-first TTT (legal)

Architecture & Training

Credits

Uh oh!

Robby955 commented Mar 25, 2026

Compliance Update

Compliant Results (560s training + 40s GPTQ calibration = 600s total)

Uh oh!

valerio-oai commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Robby955 commented Mar 24, 2026 •

edited

Loading