Skip to content

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM#1084

Open
AnubhavBharadwaaj wants to merge 3 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-slot-submission
Open

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM#1084
AnubhavBharadwaaj wants to merge 3 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-slot-submission

Conversation

@AnubhavBharadwaaj
Copy link
Copy Markdown

@AnubhavBharadwaaj AnubhavBharadwaaj commented Mar 29, 2026

val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM

First SLOT-based entry in Parameter Golf. Novel eval-time augmentation achieving -0.0008 BPB improvement over the baseline, consistent across all 3 seeds.

Results

SLOT-Enabled (3-seed)

Seed Steps Step Avg Pre-TTT BPB Post-TTT+SLOT BPB TTT+SLOT Time Artifact
1337 7,127 84.2ms 1.1385 1.1188 386s 15,965,604
42 7,155 83.9ms 1.1380 1.1185 388s 15,882,932
2025 7,152 83.9ms 1.1377 1.1183 385s 15,994,920
Mean 7,145 84.0ms 1.1381 1.1185 (std 0.0003) ~386s

Baseline Without SLOT (3-seed, same codebase with SLOT_ENABLED=0)

Seed Steps Step Avg Post-TTT BPB TTT Time
1337 7,164 83.8ms 1.1195 352s
42 7,159 83.8ms 1.1195 353s
2025 7,164 83.8ms 1.1189 350s
Mean 7,162 83.8ms 1.1193 (std 0.0003) ~352s

SLOT vs Baseline Comparison

Metric Baseline Mean SLOT Mean Delta
Post-TTT BPB 1.1193 1.1185 -0.0008
TTT eval time 352s 386s +34s
SOTA (PR #549) 1.1194
vs SOTA -0.0001 -0.0009

Also Tested: CTW (Negative Result)

Run CTW Weight Depth BPB TTT Time Verdict
CTW v1 (broken impl) 0.1 4 1.1252 2,760s +0.005 worse, 46 min eval

CTW (Context Tree Weighting) was also integrated and tested. A depth-4 Markov model over 1024 subword tokens provides no useful signal on top of a 1.12 BPB transformer — the neural model already captures everything CTW knows. Documented as a negative result.

Novel Contribution: SLOT (Sample-specific LM Optimization at Test-time)

What Is SLOT

SLOT (Hu et al., arXiv:2505.12392v2) optimizes a single additive δ ∈ ℝ^d vector at the last hidden layer to adapt the model to each batch of sequences during evaluation. Unlike full TTT which updates all 27M model parameters via SGD, SLOT optimizes just 512 parameters through one linear layer.

Why SLOT Works

SLOT addresses a different bottleneck than TTT:

  • TTT adapts the model's internal representations to local data distribution (chunk-level)
  • SLOT fine-tunes the mapping from final hidden states to logits (batch-level)

These are complementary — TTT gives SLOT better hidden states to work with, and SLOT gives TTT-adapted representations a final correction before scoring.

Implementation: Deep Integration Inside TTT

SLOT is integrated directly into the TTT scoring loop's Phase 1 — not as a separate eval pass. The architecture splits forward_logits() into forward_hidden() + compute_logits(), enabling SLOT to optimize δ between the two:

# Inside eval_val_sliding_ttt, Phase 1 scoring:
for each batch of windows:
    # 1. Get hidden states from TTT-adapted model
    H = model.forward_hidden(x_batch)       # [bsz, seq_len, 512]

    # 2. SLOT: optimize delta on this batch
    delta = zeros(1, 1, 512)                # single vector, broadcasts
    optimizer = AdamW([delta], lr=0.001)
    for step in range(3):
        logits = model.compute_logits(H + delta)
        loss = CE(logits[:, :-1], targets[:, 1:])
        loss.backward()                      # gradients only through lm_head
        optimizer.step()

    # 3. Score with adapted logits
    final_logits = model.compute_logits(H + delta)
    nll = CE(final_logits, targets)          # used for BPB

Key properties:

  • Stacks on TTT: δ operates on TTT-adapted hidden states, not base model outputs
  • Single combined score: one BPB number from SLOT-adapted logits
  • Minimal overhead: +34s to TTT eval (386s vs 352s), well within 10-min eval budget
  • Zero artifact cost: δ is optimized from scratch per-batch during eval
  • Score-first compliant: δ optimizes on tokens being scored using autoregressive shift (same tokens, but model doesn't see future tokens)
  • Clean toggle: SLOT_ENABLED=0 reproduces baseline exactly

Score-First Legality Argument

SLOT does not violate the score-first constraint because:

  1. The model weights that generated H are frozen during δ optimization
  2. δ is optimized using the standard autoregressive objective (predict token t+1 from tokens 1..t)
  3. δ is a constant offset vector — it does not give the model access to future tokens
  4. Each batch's δ is independent — no information leaks between batches

SLOT is analogous to learned post-processing (like temperature scaling) rather than model training.

Base Architecture (PR #549 by @abaybektursun)

  • 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
  • Parameter Banking + Parallel Muon (FlashAttention 3)
  • BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
  • EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
  • Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Commands

# Baseline (SLOT disabled — reproduces PR #549)
cd /workspace/parameter-golf && SEED=1337 SLOT_ENABLED=0 CTW_WEIGHT=0 \
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

# SLOT enabled (novel contribution)
cd /workspace/parameter-golf && SEED=1337 SLOT_ENABLED=1 SLOT_LR=0.001 SLOT_STEPS=3 CTW_WEIGHT=0 \
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

SLOT Hyperparameters

Parameter Value Env Var Notes
Enabled true SLOT_ENABLED=1 Set to 0 for baseline
Learning rate 0.001 SLOT_LR=0.001 Matches SLOT paper default for 7B model
Optimization steps 3 SLOT_STEPS=3 Paper default; more steps didn't help in their ablation
Optimizer AdamW weight_decay=1e-8, eps=1e-5 (from paper)
Delta shape [1, 1, 512] Broadcasts across batch and sequence dimensions
Delta init zeros Matches paper: 0.0 * torch.randn(...)

Credits

@valerio-oai or @0hq
I have not been give any credit grant. I've submitted PR #1084 (first SLOT entry, 3-seed validated) and applied for the Development grant multiple times but haven't heard back. Can someone help with the grant status? GitHub: AnubhavBharadwaaj

…result on PR openai#549 stack

First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf.
SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop.

SLOT results (3-seed):
  seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB
  mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement

Also documents CTW as a negative result across 3 implementation iterations:
  v1 (naive n-gram lookup): +0.005 worse, 46 min eval
  v2 (proper recursive weighting + entropy gating): not runnable in time budget
  v3 (vectorized entropy gate): still worse, killed early
  Root cause: signal redundancy — transformer already captures all n-gram patterns

Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)
@AnubhavBharadwaaj AnubhavBharadwaaj changed the title Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM Mar 29, 2026
@AnubhavBharadwaaj
Copy link
Copy Markdown
Author

Noting the SLOT legality discussion on PR #1172 (cc @dexhunter @NoesisGenesis). I've posted a technical counterpoint there. Requesting organizer ruling from @0hq @valerio-oai — does per-batch calibration of a constant delta vector fall within accepted evaluation methods? This affects PRs #1084, #1105, #1128, #1150, and #1172

@AnubhavBharadwaaj
Copy link
Copy Markdown
Author

@xuandong-openai @dexhunter
SLOT is temperature scaling, not answer-key studying.
The exam analogy assumes SLOT "studies the answers." But consider what δ actually is: a single constant vector added to every position equally. It cannot encode position-specific information. It cannot "know" that token 473 is "the" — it shifts the entire logit distribution uniformly across all positions. This is functionally identical to optimizing a temperature scalar or a bias vector on the output layer, which is standard calibration in ML.
A more accurate analogy: the student adjusts the brightness on his reading lamp before the exam. He tries a few settings, picks the one where he reads most clearly, then takes the exam under that lighting. The lamp doesn't know the answers — it just makes the student's existing knowledge more legible.
On the causality concern (position t seeing t+1):
The CE loss at every position t is computed from logits[t] which depends only on tokens 1..t (causal attention). The delta optimization minimizes the sum of these per-position losses, but each individual loss term is strictly causal. This is identical to how temperature scaling, Platt scaling, or any post-hoc calibration works — you find the single parameter that minimizes total loss, then apply it. The parameter itself is not position-dependent and carries no token-specific information.
If optimizing a shared scalar/vector over a batch and then scoring that batch is illegal, then temperature scaling is also illegal, and so is any form of adaptive evaluation (including the entropy-adaptive alpha used in every n-gram submission).
On the "8 steps, 0.029 BPB" surprise:
This is not surprising when you understand what SLOT does. The lm_head projection is a 512→1024 linear map trained jointly with all 27M parameters. A small additive correction to its input is effectively recalibrating the output distribution to the local data statistics. The gain comes from fixing a distribution mismatch between training and eval data, not from memorizing targets.
What I'd ask the organizers:
@0hq @valerio-oai — could you clarify whether per-batch calibration of a constant (non-position-dependent) parameter using the autoregressive loss on the scored batch falls within accepted evaluation methods? This affects PRs #1084, #1105, #1128, #1150, and #1172. The community would benefit from a clear ruling either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant