Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM#1084
Conversation
…PB (1.1185, 3-seed mean)
…result on PR openai#549 stack First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop. SLOT results (3-seed): seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement Also documents CTW as a negative result across 3 implementation iterations: v1 (naive n-gram lookup): +0.005 worse, 46 min eval v2 (proper recursive weighting + entropy gating): not runnable in time budget v3 (vectorized entropy gate): still worse, killed early Root cause: signal redundancy — transformer already captures all n-gram patterns Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)
|
Noting the SLOT legality discussion on PR #1172 (cc @dexhunter @NoesisGenesis). I've posted a technical counterpoint there. Requesting organizer ruling from @0hq @valerio-oai — does per-batch calibration of a constant delta vector fall within accepted evaluation methods? This affects PRs #1084, #1105, #1128, #1150, and #1172 |
|
@xuandong-openai @dexhunter |
val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM
First SLOT-based entry in Parameter Golf. Novel eval-time augmentation achieving -0.0008 BPB improvement over the baseline, consistent across all 3 seeds.
Results
SLOT-Enabled (3-seed)
Baseline Without SLOT (3-seed, same codebase with SLOT_ENABLED=0)
SLOT vs Baseline Comparison
Also Tested: CTW (Negative Result)
CTW (Context Tree Weighting) was also integrated and tested. A depth-4 Markov model over 1024 subword tokens provides no useful signal on top of a 1.12 BPB transformer — the neural model already captures everything CTW knows. Documented as a negative result.
Novel Contribution: SLOT (Sample-specific LM Optimization at Test-time)
What Is SLOT
SLOT (Hu et al., arXiv:2505.12392v2) optimizes a single additive δ ∈ ℝ^d vector at the last hidden layer to adapt the model to each batch of sequences during evaluation. Unlike full TTT which updates all 27M model parameters via SGD, SLOT optimizes just 512 parameters through one linear layer.
Why SLOT Works
SLOT addresses a different bottleneck than TTT:
These are complementary — TTT gives SLOT better hidden states to work with, and SLOT gives TTT-adapted representations a final correction before scoring.
Implementation: Deep Integration Inside TTT
SLOT is integrated directly into the TTT scoring loop's Phase 1 — not as a separate eval pass. The architecture splits
forward_logits()intoforward_hidden()+compute_logits(), enabling SLOT to optimize δ between the two:Key properties:
SLOT_ENABLED=0reproduces baseline exactlyScore-First Legality Argument
SLOT does not violate the score-first constraint because:
SLOT is analogous to learned post-processing (like temperature scaling) rather than model training.
Base Architecture (PR #549 by @abaybektursun)
Run Commands
SLOT Hyperparameters
SLOT_ENABLED=1SLOT_LR=0.001SLOT_STEPS=30.0 * torch.randn(...)Credits
@valerio-oai or @0hq
I have not been give any credit grant. I've submitted PR #1084 (first SLOT entry, 3-seed validated) and applied for the Development grant multiple times but haven't heard back. Can someone help with the grant status? GitHub: AnubhavBharadwaaj