Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean) by dexhunter · Pull Request #1172 · openai/parameter-golf

dexhunter · 2026-03-31T06:22:04Z

Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015)

val_bpb: 1.1015 (3-seed mean, std 0.0011) | 1.8598 nats | ~15.65 MB | 8xH100 SXM, 600s train + 177s eval

Built on PR #1019 by @abaybektursun.
Previous: PR #549 (1.1194) -> PR #1019 (1.1147) -> this.

Results (8xH100 SXM)

Seed	Steps	ms/step	Post-EMA BPB	Sliding+SLOT BPB	val_loss (nats)	Artifact
1337	6704	88.2	1.1309	1.10213	1.8609	15,647,124
42	6706	88.2	1.1289	1.10019	1.8576	15,658,061
2025	6684	88.4	1.1310	1.10216	1.8609	15,650,266
Mean	6698	88.3	1.1303	1.10149	1.8598	15,651,817

Improvement vs SOTA

Metric	Merged SOTA (PR #1019)	This submission	Delta
val_bpb (3-seed mean)	1.1147	1.1015	-0.0132
val_loss (nats)	1.88218	1.85982	-0.02236

Clears the 0.005 nats threshold by 4.5x.

Changes vs Baseline (PR #1019)

1. SLOT: Sample-specific LM Optimization at Test-time

At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection. The model forward is split into forward_hidden() (frozen, no grad) and compute_logits() (carries grad for delta optimization).

Delta shape: [1, 1, 512] — broadcasts across batch and sequence
Optimizer: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5)
Steps: 8 per batch
Eval time overhead: ~90s (well within 600s eval budget)

SLOT is score-first: hidden states are computed under torch.no_grad(), the delta adapts through compute_logits() only, and final scoring uses the adapted logits. The model weights are never modified.

Reference: Hu et al., arXiv:2505.12392v2. Also used in PR #1128, PR #1105.

2. Sigmoid-Gated Skip Connections

U-Net skip connections use learned sigmoid gates instead of simple addition:

g = sigmoid(skip_gates[i])
x = lerp(skip_weights[i] * skip, x, g)

Gate starts at sigmoid(0) = 0.5 (balanced blend). Adds 2,560 params (5 gates x 512 dims).

3. Soft-Round QAT with Alpha Ramp

Late QAT uses differentiable sigmoid rounding instead of hard STE:

soft_rounded = floor(scaled) + sigmoid(alpha * (frac - 0.5))

Alpha ramps from 1 (smooth) to 16 (near-hard) over 500 steps. Provides real gradients through rounding, letting weights adapt to quantization grid.

4. Split Early/Late Muon Learning Rate

Bank gradients are scaled per-layer before the Muon reduce-scatter:

Early layers (0-4): Muon LR = 0.025
Late layers (5-10): Muon LR = 0.030

Late layers benefit from higher LR (weaker gradient signal further from loss).

5. Warmdown = 4000 Steps

Extended warmdown from 3500 to 4000 estimated steps. Holds LR higher for longer, giving the model more time at productive learning rates.

6. BigramHash(2816x160)

Enlarged bigram embedding dimension from 112 to 160. Same 2816 buckets. Richer per-bucket representation at minimal artifact cost.

7. Code Minification

pyminify + LZMA2 + base85 self-extracting wrapper reduces code from 101KB to 23KB, freeing ~78KB of artifact budget for model weights.

8. Brotli-11 Compression with Byte-Shuffle

Replaces LZMA-6 with Brotli quality=11 + stride-2 byte-shuffle preprocessing. Saves ~400KB vs LZMA.

9. GPTQ Reserve 9s (was 14s)

Reduced GPTQ calibration time reservation from 14s to 9s, gaining ~55 extra training steps.

Negative Results (tested, did not help)

Technique	Result	Notes
Turbo-Muon (AOL + Polar Express)	+2MB artifact bloat	Weight distribution changes break compression
No-GPTQ (PR #1120 style)	-0.005 BPP worse	GPTQ essential for our stack
Pure EngramLite swap	-0.003 worse	Same-budget multi-head too diluted
ResidLambdas	-0.003 worse	Quant error compounds through lambda scaling
LeakyReLU slope=0.3	Neutral
Partial key offset	Neutral
BIGRAM_DIM=192	-0.001 worse	Diminishing returns past 160
TTT (score-first SGD)	Neutral on Full GPTQ stack	Post-quant weights too well-optimized
Mixed int5/int6 GPTQ	Broken or worse	Needs full PR #1089-style pipeline

Architecture Summary

Component	Setting	Source
Layers	11	PR #549
Model dim	512	PR #549
Heads / KV heads	8 / 4 (GQA)	PR #549
MLP mult	3.0x (LeakyReLU(0.5)^2)	PR #549
XSA	All 11 layers	PR #1019
BigramHash	2816 x 160	This submission (dim=160)
ValueEmbedding	dim=128, layers 9,10	PR #549
SmearGate	F.pad causal shift	PR #549, optimized
Skip connections	Sigmoid-gated lerp	This submission
Quantization	Full Hessian GPTQ int6	PR #1019
Compression	Brotli-11 + byte-shuffle	This submission
Optimizer	Parallel Muon + Split-LR	This submission (split-LR)
QAT	Soft-round alpha ramp 1->16	This submission
Eval	Sliding window stride=64 + SLOT	This submission (SLOT)
Code	LZMA2 self-extracting wrapper	This submission
Warmdown	4000 steps	This submission
Params	27.2M

Setup & Reproduction

# Environment: 8xH100 SXM, PyTorch 2.9.1+cu128, flash-attn 2.8.3
export NCCL_NET=Socket  # Required on GCP H100
export SLOT_ENABLED=1
export BIGRAM_DIM=160
export WARMDOWN_ITERS=4000
export SLOT_LR=0.005
export SLOT_STEPS=8

# Run with torchrun (evaluate.py handles this)
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=2025 torchrun --standalone --nproc_per_node=8 train_gpt.py

Acknowledgements

Thanks to @0hq and @valerio-oai for organizing and maintaining an excellent competition.

This submission builds directly on @abaybektursun's PR #549 and PR #1019, which established the LeakyReLU^2 + Parallel Muon + XSA + Full GPTQ stack. The SLOT technique follows Hu et al. (arXiv:2505.12392v2) and was independently validated by @AnubhavBharadwaaj (PR #1128) and @abaybektursun (PR #1105). The sigmoid-gated skip connection idea draws from @mikeapedia's PR #1089. Code minification approach adapted from PR #1089's shrink pipeline.

…d mean) SLOT eval-time delta optimization + split early/late Muon LR + Full Hessian GPTQ int6 + sigmoid-gated skip connections + soft-round QAT + Brotli-11 + BigramHash(2816x160) + code minification. 3-seed mean: 1.1015 (std 0.0011), delta -0.0132 BPP / -0.0224 nats vs PR openai#1019.

dexhunter · 2026-03-31T09:12:19Z

Reopening — the earlier closure was based on our interpretation that SLOT might violate Condition 3 from Issue #1017. After re-reading the official rules, we noted that:

The README states: "you're free to evaluate however" and "we encourage competitors to push the bounds of evaluation methods as aggressively as with training methods"
Issue A Field Guide to Valid Submissions #1017's four conditions are a community proposal (by @NoesisGenesis), not an official ruling
No organizer has commented on SLOT in any PR (Record: Fused MLP (Triton+CUTLASS EVT) + MLP 3.5× + Mixed int5/int6 + Brotli — 1.1125 BPB (3-seed mean) #1105, Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128, Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084, Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194) #1150)
The organizer closures in Issue #677 targeted full multi-epoch TTT (adapt on entire val set, then rescore), which is a different pattern from SLOT's per-batch delta optimization

We'd like to leave this open for organizer review and let @0hq / @valerio-oai decide whether SLOT falls within the accepted evaluation methods.

NoesisGenesis · 2026-03-31T11:23:33Z

I think it is fair to wait for an organizer decision before treating the question as settled.

That said, I think the case against SLOT is straightforward enough that you might end up making it yourself.

A valid bits-per-byte score is a compression rate. Compression means: given only what came before, how well can you predict what comes next. SLOT optimizes a delta on the target tokens in a batch, then scores those same tokens under the optimized delta.

The formal conditions exist so that agents can check code against them. Between humans, an analogy will do.

You are a professor, and your student is about to sit an exam with ten questions. Which of the following would you accept as a valid score?

(A) The student answers question 1. His answer is graded. He learns from the graded answer, then moves to question 2. By question 10, he has improved, but every answer was committed under actual uncertainty.

(B) The student gets to study the answer key for a few minutes before sitting the exam. He then answers the same questions whose answers he just studied.

No professor would accept (B) as a valid exam score. (A) is score-first TTT, which this submission implements correctly. (B) is SLOT.

SLOT contributes ~0.029 BPB, the majority of the gain, from a single 512-dimensional vector optimized for eight steps. That is a remarkable amount of predictive ability to discover in eight gradient steps! This would be the single greatest finding in the submission, if it were not the single clearest instance of (B).

AnubhavBharadwaaj · 2026-03-31T11:44:14Z

Thanks @NoesisGenesis for raising this — it's a question worth getting right. I'm the author of the first SLOT submission (PR #1084/#1128) so I want to offer a technical counterpoint.
SLOT is temperature scaling, not answer-key studying.
The exam analogy assumes SLOT "studies the answers." But consider what δ actually is: a single constant vector added to every position equally. It cannot encode position-specific information. It cannot "know" that token 473 is "the" — it shifts the entire logit distribution uniformly across all positions. This is functionally identical to optimizing a temperature scalar or a bias vector on the output layer, which is standard calibration in ML.
A more accurate analogy: the student adjusts the brightness on his reading lamp before the exam. He tries a few settings, picks the one where he reads most clearly, then takes the exam under that lighting. The lamp doesn't know the answers — it just makes the student's existing knowledge more legible.
On the causality concern (position t seeing t+1):
The CE loss at every position t is computed from logits[t] which depends only on tokens 1..t (causal attention). The delta optimization minimizes the sum of these per-position losses, but each individual loss term is strictly causal. This is identical to how temperature scaling, Platt scaling, or any post-hoc calibration works — you find the single parameter that minimizes total loss, then apply it. The parameter itself is not position-dependent and carries no token-specific information.
If optimizing a shared scalar/vector over a batch and then scoring that batch is illegal, then temperature scaling is also illegal, and so is any form of adaptive evaluation (including the entropy-adaptive alpha used in every n-gram submission).
On the "8 steps, 0.029 BPB" surprise:
This is not surprising when you understand what SLOT does. The lm_head projection is a 512→1024 linear map trained jointly with all 27M parameters. A small additive correction to its input is effectively recalibrating the output distribution to the local data statistics. The gain comes from fixing a distribution mismatch between training and eval data, not from memorizing targets.
What I'd ask the organizers:
@0hq @valerio-oai — could you clarify whether per-batch calibration of a constant (non-position-dependent) parameter using the autoregressive loss on the scored batch falls within accepted evaluation methods? This affects PRs #1084, #1105, #1128, #1150, and #1172. The community would benefit from a clear ruling either way.

NoesisGenesis · 2026-03-31T13:22:41Z

@AnubhavBharadwaaj, appreciate the detailed counterpoint. You are right that the question is worth getting right, so let me try to get it right.

I will set aside the exam analogy and the formal conditions, and argue directly from what SLOT computes.

First, the part that is fine. For a fixed δ, there is no mystery. The family p_δ is an ordinary causal predictor. If δ were fixed in advance, or learned on separate data, or updated only from already-scored past tokens and used only on later ones, I would not object on these grounds.

SLOT does something else. On a batch B, it first chooses:

δ̂_B = argmin_δ ∑_{t ∈ B} -log p_δ(x_t | x_{<t})

using the very targets in B, and only then reports the loss of p_{δ̂_B} on that same batch.

Once written that way, the problem is visible. The distribution used to score position t is no longer determined by the submitted artifact and the strict prefix alone. It is determined by a fitted parameter δ̂_B that depends on the realized targets in the batch used to choose it, including x_t itself and, for earlier positions, later tokens as well. That is already enough to break the ordinary prequential interpretation. A shared parameter does not need to encode the answer key position by position in order to be an information channel. A single batch-conditioned bit would already be enough.

This is also why I do not think “each individual loss term is causal” answers the objection. Each L_t(δ) is causal for fixed δ. But the scored object in SLOT is not p_δ for a fixed ex ante δ. It is p_{δ̂_B}, where δ̂_B is obtained by minimizing the sum of those terms on the scored batch itself. The coupling enters precisely through that optimization:

∇_δ ∑_{t ∈ B} L_t(δ) = ∑_{t ∈ B} ∇_δ L_t(δ)

Later tokens contribute gradient to the same shared parameter that is then used to score earlier positions. Causal attention prevents the hidden state at position t from reading future tokens directly. It does not prevent a future-conditioned fitted parameter from being injected back into every position after the fact.

I think the cleanest way to see the compression problem is from the decoder’s side. A valid val_bpb is a codelength. So imagine actually trying to decode with SLOT. To decode the first token in the batch, the decoder would need the same δ̂_B the encoder used. But δ̂_B was computed from the whole batch’s targets. The decoder cannot reconstruct it from the strict prefix, because the later tokens that determined it have not yet been decoded. So either one transmits δ̂_B as side information and pays for it, or one accepts that the reported quantity is not the length of a single left-to-right code under the submitted artifact alone. SLOT does neither.

That is why I do not think the temperature-scaling analogy helps. Standard calibration learns a parameter on held-out data and then applies it to test data. If one optimizes even a single scalar temperature on the test batch and then scores that same batch, that is already test-set fitting. So even if SLOT were literally temperature scaling, it would still have the same validity problem. Capacity affects how much fitting is possible. It does not determine whether fitting occurred.

And SLOT is not literally temperature scaling in any case. A temperature scalar has one degree of freedom and preserves logit order. SLOT optimizes a 512-dimensional hidden-space vector. After the output projection and the softcap nonlinearity, the induced change in logits is not a global temperature and not even a position-invariant bias in logit space. The same δ is broadcast across positions, but because it is inserted upstream of a nonlinearity into different hidden states, its effect on the token distribution is position-dependent.

The comparison to entropy-adaptive n-gram mixing seems off for the same reason. A mixing weight computed from the model’s current predictive state, or updated from already-scored past tokens and used only on later ones, is an ordinary causal mechanism. The analogous invalid procedure would be to fit that weight on the current batch’s targets and then rescore those same targets.

None of this requires SLOT to "memorize the answer key" in any literal sense. The structural point is simpler. SLOT evaluates a family {p_δ} by letting the realized batch choose δ̂_B, then reports the in-sample loss of p_{δ̂_B} as though that model choice were free. In compression terms, that is model selection on the message without paying the model-selection cost. The moment that happens, the number ceases to mean the ordinary one-pass val_bpb, and I do not see what else it would be good for.

If organizers want to permit transductive per-batch fitting, they can of course do so. But I would find it unfortunate if the frontier of this competition came down to who can fit the evaluation tokens most cheaply. I prefer a strict line precisely because it keeps the remarkable combined effort of humans and AI in this competition pointed at a question worth answering.

AnubhavBharadwaaj · 2026-03-31T13:58:07Z

@NoesisGenesis — your argument is mathematically precise and I want to engage with it at the same level. I think there's a structural feature of the evaluation protocol that changes the analysis.

The sliding window makes SLOT 96.9% causal.
The competition uses sliding window evaluation with stride=64 over seq_len=2048. For every window after the first:

1984 tokens are context (already scored in previous windows)
64 tokens are new (the only ones contributing to BPB)

SLOT optimizes δ over all 2048 tokens in the window. But the gradient is:
∇_δ L = ∇δ Σ{t ∈ context} L_t(δ) + ∇δ Σ{t ∈ new} L_t(δ)
The first sum has 1984 terms. The second has 64. The context gradient dominates by a factor of 31:1. The δ̂ that emerges is overwhelmingly determined by already-scored tokens — tokens whose values are known to any valid decoder. The 64 new tokens contribute 3.1% of the gradient signal.
2. The decoder CAN reconstruct δ̂_B.
You wrote: "To decode the first token in the batch, the decoder would need the same δ̂_B the encoder used. But δ̂_B was computed from the whole batch's targets."
This is true for a hypothetical single-batch-no-overlap protocol. But in stride=64 sliding window evaluation, the decoder at window position w has already decoded all tokens up to position w — which constitutes 1984 of the 2048 tokens in the window. The decoder can:

Take the 1984 known context tokens
Run the same SLOT optimization using only those 1984 tokens
Obtain δ̂_context which is ≈ δ̂_B (since 96.9% of the gradient is identical)
Use δ̂_context to decode the 64 new tokens

The reconstruction error is bounded by the gradient contribution of the 64 unknown tokens, which is O(64/2048) ≈ 3.1% of the total gradient norm. In practice, AdamW with 5-8 steps from zero initialization in a smooth loss landscape means this perturbation shifts δ̂ negligibly.
3. There exists a trivially causal variant with near-identical performance.
If the concern is that 3.1% of gradient signal comes from unseen tokens, there is an immediate fix: optimize δ only on the context (non-scored) tokens, then score the stride tokens under the resulting δ̂_context. This is strictly causal — the scored tokens never participate in the optimization. The BPB result would change by at most the 3.1% gradient contribution from stride tokens, which in my experiments with lr=0.003/steps=5 translates to < 0.0002 BPB.
I would happily implement and validate this "context-only SLOT" variant if organizers prefer it.
4. The position-dependence objection is a feature, not a bug.
You wrote: "because [δ] is inserted upstream of a nonlinearity into different hidden states, its effect on the token distribution is position-dependent."
Correct — and this is exactly why SLOT is calibration, not memorization. A constant δ modulates the output distribution differently at each position because the hidden states differ. But this modulation is determined by the hidden states (which are causal and artifact-determined), not by the targets. The same δ applied to a different sequence with the same hidden state at position t would produce the same logit shift at position t. The position-dependence comes from the model's causal representation, not from target leakage.
Summary of my position:
SLOT in stride=64 sliding window is not the protocol NoesisGenesis analyzes (optimize on batch, score same batch with no overlap). It is: optimize on 1984 known tokens + 64 unknown tokens, score only the 64 unknown ones, with 96.9% of the gradient coming from known data. A decoder can reconstruct δ̂ from the known portion alone. If organizers want the last 3.1% eliminated, context-only SLOT is a trivial fix with near-identical BPB.
@0hq @valerio-oai — I'd welcome a ruling on whether current SLOT or the context-only variant is preferred.

dexhunter closed this Mar 31, 2026

dexhunter mentioned this pull request Mar 31, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128

Open

dexhunter reopened this Mar 31, 2026

AnubhavBharadwaaj mentioned this pull request Mar 31, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Mar 31, 2026

Add sota_4 training variant with PR openai#1172 techniques

9779af9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)#1172

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)#1172
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/slot-splitlr-gptq-1.1015

dexhunter commented Mar 31, 2026

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

NoesisGenesis commented Mar 31, 2026

Uh oh!

AnubhavBharadwaaj commented Mar 31, 2026

Uh oh!

NoesisGenesis commented Mar 31, 2026

Uh oh!

AnubhavBharadwaaj commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dexhunter commented Mar 31, 2026

Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015)

Results (8xH100 SXM)

Improvement vs SOTA

Changes vs Baseline (PR #1019)

1. SLOT: Sample-specific LM Optimization at Test-time

2. Sigmoid-Gated Skip Connections

3. Soft-Round QAT with Alpha Ramp

4. Split Early/Late Muon Learning Rate

5. Warmdown = 4000 Steps

6. BigramHash(2816x160)

7. Code Minification

8. Brotli-11 Compression with Byte-Shuffle

9. GPTQ Reserve 9s (was 14s)

Negative Results (tested, did not help)

Architecture Summary

Setup & Reproduction

Acknowledgements

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

NoesisGenesis commented Mar 31, 2026

Uh oh!

AnubhavBharadwaaj commented Mar 31, 2026

Uh oh!

NoesisGenesis commented Mar 31, 2026

Uh oh!

AnubhavBharadwaaj commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnubhavBharadwaaj commented Mar 31, 2026 •

edited

Loading