Skip to content

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)#1172

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/slot-splitlr-gptq-1.1015
Open

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)#1172
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/slot-splitlr-gptq-1.1015

Conversation

@dexhunter
Copy link
Copy Markdown

Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015)

val_bpb: 1.1015 (3-seed mean, std 0.0011) | 1.8598 nats | ~15.65 MB | 8xH100 SXM, 600s train + 177s eval

Built on PR #1019 by @abaybektursun.
Previous: PR #549 (1.1194) -> PR #1019 (1.1147) -> this.

Results (8xH100 SXM)

Seed Steps ms/step Post-EMA BPB Sliding+SLOT BPB val_loss (nats) Artifact
1337 6704 88.2 1.1309 1.10213 1.8609 15,647,124
42 6706 88.2 1.1289 1.10019 1.8576 15,658,061
2025 6684 88.4 1.1310 1.10216 1.8609 15,650,266
Mean 6698 88.3 1.1303 1.10149 1.8598 15,651,817

Improvement vs SOTA

Metric Merged SOTA (PR #1019) This submission Delta
val_bpb (3-seed mean) 1.1147 1.1015 -0.0132
val_loss (nats) 1.88218 1.85982 -0.02236

Clears the 0.005 nats threshold by 4.5x.

Changes vs Baseline (PR #1019)

1. SLOT: Sample-specific LM Optimization at Test-time

At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection. The model forward is split into forward_hidden() (frozen, no grad) and compute_logits() (carries grad for delta optimization).

  • Delta shape: [1, 1, 512] — broadcasts across batch and sequence
  • Optimizer: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5)
  • Steps: 8 per batch
  • Eval time overhead: ~90s (well within 600s eval budget)

SLOT is score-first: hidden states are computed under torch.no_grad(), the delta adapts through compute_logits() only, and final scoring uses the adapted logits. The model weights are never modified.

Reference: Hu et al., arXiv:2505.12392v2. Also used in PR #1128, PR #1105.

2. Sigmoid-Gated Skip Connections

U-Net skip connections use learned sigmoid gates instead of simple addition:

g = sigmoid(skip_gates[i])
x = lerp(skip_weights[i] * skip, x, g)

Gate starts at sigmoid(0) = 0.5 (balanced blend). Adds 2,560 params (5 gates x 512 dims).

3. Soft-Round QAT with Alpha Ramp

Late QAT uses differentiable sigmoid rounding instead of hard STE:

soft_rounded = floor(scaled) + sigmoid(alpha * (frac - 0.5))

Alpha ramps from 1 (smooth) to 16 (near-hard) over 500 steps. Provides real gradients through rounding, letting weights adapt to quantization grid.

4. Split Early/Late Muon Learning Rate

Bank gradients are scaled per-layer before the Muon reduce-scatter:

  • Early layers (0-4): Muon LR = 0.025
  • Late layers (5-10): Muon LR = 0.030

Late layers benefit from higher LR (weaker gradient signal further from loss).

5. Warmdown = 4000 Steps

Extended warmdown from 3500 to 4000 estimated steps. Holds LR higher for longer, giving the model more time at productive learning rates.

6. BigramHash(2816x160)

Enlarged bigram embedding dimension from 112 to 160. Same 2816 buckets. Richer per-bucket representation at minimal artifact cost.

7. Code Minification

pyminify + LZMA2 + base85 self-extracting wrapper reduces code from 101KB to 23KB, freeing ~78KB of artifact budget for model weights.

8. Brotli-11 Compression with Byte-Shuffle

Replaces LZMA-6 with Brotli quality=11 + stride-2 byte-shuffle preprocessing. Saves ~400KB vs LZMA.

9. GPTQ Reserve 9s (was 14s)

Reduced GPTQ calibration time reservation from 14s to 9s, gaining ~55 extra training steps.

Negative Results (tested, did not help)

Technique Result Notes
Turbo-Muon (AOL + Polar Express) +2MB artifact bloat Weight distribution changes break compression
No-GPTQ (PR #1120 style) -0.005 BPP worse GPTQ essential for our stack
Pure EngramLite swap -0.003 worse Same-budget multi-head too diluted
ResidLambdas -0.003 worse Quant error compounds through lambda scaling
LeakyReLU slope=0.3 Neutral
Partial key offset Neutral
BIGRAM_DIM=192 -0.001 worse Diminishing returns past 160
TTT (score-first SGD) Neutral on Full GPTQ stack Post-quant weights too well-optimized
Mixed int5/int6 GPTQ Broken or worse Needs full PR #1089-style pipeline

Architecture Summary

Component Setting Source
Layers 11 PR #549
Model dim 512 PR #549
Heads / KV heads 8 / 4 (GQA) PR #549
MLP mult 3.0x (LeakyReLU(0.5)^2) PR #549
XSA All 11 layers PR #1019
BigramHash 2816 x 160 This submission (dim=160)
ValueEmbedding dim=128, layers 9,10 PR #549
SmearGate F.pad causal shift PR #549, optimized
Skip connections Sigmoid-gated lerp This submission
Quantization Full Hessian GPTQ int6 PR #1019
Compression Brotli-11 + byte-shuffle This submission
Optimizer Parallel Muon + Split-LR This submission (split-LR)
QAT Soft-round alpha ramp 1->16 This submission
Eval Sliding window stride=64 + SLOT This submission (SLOT)
Code LZMA2 self-extracting wrapper This submission
Warmdown 4000 steps This submission
Params 27.2M

Setup & Reproduction

# Environment: 8xH100 SXM, PyTorch 2.9.1+cu128, flash-attn 2.8.3
export NCCL_NET=Socket  # Required on GCP H100
export SLOT_ENABLED=1
export BIGRAM_DIM=160
export WARMDOWN_ITERS=4000
export SLOT_LR=0.005
export SLOT_STEPS=8

# Run with torchrun (evaluate.py handles this)
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=2025 torchrun --standalone --nproc_per_node=8 train_gpt.py

Acknowledgements

Thanks to @0hq and @valerio-oai for organizing and maintaining an excellent competition.

This submission builds directly on @abaybektursun's PR #549 and PR #1019, which established the LeakyReLU^2 + Parallel Muon + XSA + Full GPTQ stack. The SLOT technique follows Hu et al. (arXiv:2505.12392v2) and was independently validated by @AnubhavBharadwaaj (PR #1128) and @abaybektursun (PR #1105). The sigmoid-gated skip connection idea draws from @mikeapedia's PR #1089. Code minification approach adapted from PR #1089's shrink pipeline.

…d mean)

SLOT eval-time delta optimization + split early/late Muon LR +
Full Hessian GPTQ int6 + sigmoid-gated skip connections +
soft-round QAT + Brotli-11 + BigramHash(2816x160) + code minification.

3-seed mean: 1.1015 (std 0.0011), delta -0.0132 BPP / -0.0224 nats vs PR openai#1019.
@dexhunter
Copy link
Copy Markdown
Author

Reopening — the earlier closure was based on our interpretation that SLOT might violate Condition 3 from Issue #1017. After re-reading the official rules, we noted that:

  1. The README states: "you're free to evaluate however" and "we encourage competitors to push the bounds of evaluation methods as aggressively as with training methods"
  2. Issue A Field Guide to Valid Submissions #1017's four conditions are a community proposal (by @NoesisGenesis), not an official ruling
  3. No organizer has commented on SLOT in any PR (Record: Fused MLP (Triton+CUTLASS EVT) + MLP 3.5× + Mixed int5/int6 + Brotli — 1.1125 BPB (3-seed mean) #1105, Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128, Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084, Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194) #1150)
  4. The organizer closures in Issue #677 targeted full multi-epoch TTT (adapt on entire val set, then rescore), which is a different pattern from SLOT's per-batch delta optimization

We'd like to leave this open for organizer review and let @0hq / @valerio-oai decide whether SLOT falls within the accepted evaluation methods.

@NoesisGenesis
Copy link
Copy Markdown

I think it is fair to wait for an organizer decision before treating the question as settled.

That said, I think the case against SLOT is straightforward enough that you might end up making it yourself.

A valid bits-per-byte score is a compression rate. Compression means: given only what came before, how well can you predict what comes next. SLOT optimizes a delta on the target tokens in a batch, then scores those same tokens under the optimized delta.

The formal conditions exist so that agents can check code against them. Between humans, an analogy will do.

You are a professor, and your student is about to sit an exam with ten questions. Which of the following would you accept as a valid score?

(A) The student answers question 1. His answer is graded. He learns from the graded answer, then moves to question 2. By question 10, he has improved, but every answer was committed under actual uncertainty.

(B) The student gets to study the answer key for a few minutes before sitting the exam. He then answers the same questions whose answers he just studied.

No professor would accept (B) as a valid exam score. (A) is score-first TTT, which this submission implements correctly. (B) is SLOT.

SLOT contributes ~0.029 BPB, the majority of the gain, from a single 512-dimensional vector optimized for eight steps. That is a remarkable amount of predictive ability to discover in eight gradient steps! This would be the single greatest finding in the submission, if it were not the single clearest instance of (B).

@AnubhavBharadwaaj
Copy link
Copy Markdown

Thanks @NoesisGenesis for raising this — it's a question worth getting right. I'm the author of the first SLOT submission (PR #1084/#1128) so I want to offer a technical counterpoint.
SLOT is temperature scaling, not answer-key studying.
The exam analogy assumes SLOT "studies the answers." But consider what δ actually is: a single constant vector added to every position equally. It cannot encode position-specific information. It cannot "know" that token 473 is "the" — it shifts the entire logit distribution uniformly across all positions. This is functionally identical to optimizing a temperature scalar or a bias vector on the output layer, which is standard calibration in ML.
A more accurate analogy: the student adjusts the brightness on his reading lamp before the exam. He tries a few settings, picks the one where he reads most clearly, then takes the exam under that lighting. The lamp doesn't know the answers — it just makes the student's existing knowledge more legible.
On the causality concern (position t seeing t+1):
The CE loss at every position t is computed from logits[t] which depends only on tokens 1..t (causal attention). The delta optimization minimizes the sum of these per-position losses, but each individual loss term is strictly causal. This is identical to how temperature scaling, Platt scaling, or any post-hoc calibration works — you find the single parameter that minimizes total loss, then apply it. The parameter itself is not position-dependent and carries no token-specific information.
If optimizing a shared scalar/vector over a batch and then scoring that batch is illegal, then temperature scaling is also illegal, and so is any form of adaptive evaluation (including the entropy-adaptive alpha used in every n-gram submission).
On the "8 steps, 0.029 BPB" surprise:
This is not surprising when you understand what SLOT does. The lm_head projection is a 512→1024 linear map trained jointly with all 27M parameters. A small additive correction to its input is effectively recalibrating the output distribution to the local data statistics. The gain comes from fixing a distribution mismatch between training and eval data, not from memorizing targets.
What I'd ask the organizers:
@0hq @valerio-oai — could you clarify whether per-batch calibration of a constant (non-position-dependent) parameter using the autoregressive loss on the scored batch falls within accepted evaluation methods? This affects PRs #1084, #1105, #1128, #1150, and #1172. The community would benefit from a clear ruling either way.

@NoesisGenesis
Copy link
Copy Markdown

@AnubhavBharadwaaj, appreciate the detailed counterpoint. You are right that the question is worth getting right, so let me try to get it right.

I will set aside the exam analogy and the formal conditions, and argue directly from what SLOT computes.

First, the part that is fine. For a fixed δ, there is no mystery. The family p_δ is an ordinary causal predictor. If δ were fixed in advance, or learned on separate data, or updated only from already-scored past tokens and used only on later ones, I would not object on these grounds.

SLOT does something else. On a batch B, it first chooses:

δ̂_B = argmin_δ ∑_{t ∈ B} -log p_δ(x_t | x_{<t})

using the very targets in B, and only then reports the loss of p_{δ̂_B} on that same batch.

Once written that way, the problem is visible. The distribution used to score position t is no longer determined by the submitted artifact and the strict prefix alone. It is determined by a fitted parameter δ̂_B that depends on the realized targets in the batch used to choose it, including x_t itself and, for earlier positions, later tokens as well. That is already enough to break the ordinary prequential interpretation. A shared parameter does not need to encode the answer key position by position in order to be an information channel. A single batch-conditioned bit would already be enough.

This is also why I do not think “each individual loss term is causal” answers the objection. Each L_t(δ) is causal for fixed δ. But the scored object in SLOT is not p_δ for a fixed ex ante δ. It is p_{δ̂_B}, where δ̂_B is obtained by minimizing the sum of those terms on the scored batch itself. The coupling enters precisely through that optimization:

∇_δ ∑_{t ∈ B} L_t(δ) = ∑_{t ∈ B} ∇_δ L_t(δ)

Later tokens contribute gradient to the same shared parameter that is then used to score earlier positions. Causal attention prevents the hidden state at position t from reading future tokens directly. It does not prevent a future-conditioned fitted parameter from being injected back into every position after the fact.

I think the cleanest way to see the compression problem is from the decoder’s side. A valid val_bpb is a codelength. So imagine actually trying to decode with SLOT. To decode the first token in the batch, the decoder would need the same δ̂_B the encoder used. But δ̂_B was computed from the whole batch’s targets. The decoder cannot reconstruct it from the strict prefix, because the later tokens that determined it have not yet been decoded. So either one transmits δ̂_B as side information and pays for it, or one accepts that the reported quantity is not the length of a single left-to-right code under the submitted artifact alone. SLOT does neither.

That is why I do not think the temperature-scaling analogy helps. Standard calibration learns a parameter on held-out data and then applies it to test data. If one optimizes even a single scalar temperature on the test batch and then scores that same batch, that is already test-set fitting. So even if SLOT were literally temperature scaling, it would still have the same validity problem. Capacity affects how much fitting is possible. It does not determine whether fitting occurred.

And SLOT is not literally temperature scaling in any case. A temperature scalar has one degree of freedom and preserves logit order. SLOT optimizes a 512-dimensional hidden-space vector. After the output projection and the softcap nonlinearity, the induced change in logits is not a global temperature and not even a position-invariant bias in logit space. The same δ is broadcast across positions, but because it is inserted upstream of a nonlinearity into different hidden states, its effect on the token distribution is position-dependent.

The comparison to entropy-adaptive n-gram mixing seems off for the same reason. A mixing weight computed from the model’s current predictive state, or updated from already-scored past tokens and used only on later ones, is an ordinary causal mechanism. The analogous invalid procedure would be to fit that weight on the current batch’s targets and then rescore those same targets.

None of this requires SLOT to "memorize the answer key" in any literal sense. The structural point is simpler. SLOT evaluates a family {p_δ} by letting the realized batch choose δ̂_B, then reports the in-sample loss of p_{δ̂_B} as though that model choice were free. In compression terms, that is model selection on the message without paying the model-selection cost. The moment that happens, the number ceases to mean the ordinary one-pass val_bpb, and I do not see what else it would be good for.

If organizers want to permit transductive per-batch fitting, they can of course do so. But I would find it unfortunate if the frontier of this competition came down to who can fit the evaluation tokens most cheaply. I prefer a strict line precisely because it keeps the remarkable combined effort of humans and AI in this competition pointed at a question worth answering.

@AnubhavBharadwaaj
Copy link
Copy Markdown

AnubhavBharadwaaj commented Mar 31, 2026

@NoesisGenesis — your argument is mathematically precise and I want to engage with it at the same level. I think there's a structural feature of the evaluation protocol that changes the analysis.

  1. The sliding window makes SLOT 96.9% causal.
    The competition uses sliding window evaluation with stride=64 over seq_len=2048. For every window after the first:

1984 tokens are context (already scored in previous windows)
64 tokens are new (the only ones contributing to BPB)

SLOT optimizes δ over all 2048 tokens in the window. But the gradient is:
∇_δ L = ∇δ Σ{t ∈ context} L_t(δ) + ∇δ Σ{t ∈ new} L_t(δ)
The first sum has 1984 terms. The second has 64. The context gradient dominates by a factor of 31:1. The δ̂ that emerges is overwhelmingly determined by already-scored tokens — tokens whose values are known to any valid decoder. The 64 new tokens contribute 3.1% of the gradient signal.
2. The decoder CAN reconstruct δ̂_B.
You wrote: "To decode the first token in the batch, the decoder would need the same δ̂_B the encoder used. But δ̂_B was computed from the whole batch's targets."
This is true for a hypothetical single-batch-no-overlap protocol. But in stride=64 sliding window evaluation, the decoder at window position w has already decoded all tokens up to position w — which constitutes 1984 of the 2048 tokens in the window. The decoder can:

Take the 1984 known context tokens
Run the same SLOT optimization using only those 1984 tokens
Obtain δ̂_context which is ≈ δ̂_B (since 96.9% of the gradient is identical)
Use δ̂_context to decode the 64 new tokens

The reconstruction error is bounded by the gradient contribution of the 64 unknown tokens, which is O(64/2048) ≈ 3.1% of the total gradient norm. In practice, AdamW with 5-8 steps from zero initialization in a smooth loss landscape means this perturbation shifts δ̂ negligibly.
3. There exists a trivially causal variant with near-identical performance.
If the concern is that 3.1% of gradient signal comes from unseen tokens, there is an immediate fix: optimize δ only on the context (non-scored) tokens, then score the stride tokens under the resulting δ̂_context. This is strictly causal — the scored tokens never participate in the optimization. The BPB result would change by at most the 3.1% gradient contribution from stride tokens, which in my experiments with lr=0.003/steps=5 translates to < 0.0002 BPB.
I would happily implement and validate this "context-only SLOT" variant if organizers prefer it.
4. The position-dependence objection is a feature, not a bug.
You wrote: "because [δ] is inserted upstream of a nonlinearity into different hidden states, its effect on the token distribution is position-dependent."
Correct — and this is exactly why SLOT is calibration, not memorization. A constant δ modulates the output distribution differently at each position because the hidden states differ. But this modulation is determined by the hidden states (which are causal and artifact-determined), not by the targets. The same δ applied to a different sequence with the same hidden state at position t would produce the same logit shift at position t. The position-dependence comes from the model's causal representation, not from target leakage.
Summary of my position:
SLOT in stride=64 sliding window is not the protocol NoesisGenesis analyzes (optimize on batch, score same batch with no overlap). It is: optimize on 1984 known tokens + 64 unknown tokens, score only the 64 unknown ones, with 96.9% of the gradient coming from known data. A decoder can reconstruct δ̂ from the known portion alone. If organizers want the last 3.1% eliminated, context-only SLOT is a trivial fix with near-identical BPB.
@0hq @valerio-oai — I'd welcome a ruling on whether current SLOT or the context-only variant is preferred.

PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants