diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md
new file mode 100644
index 0000000000..c2242d93b7
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md
@@ -0,0 +1,65 @@
+## What this is
+
+A collection of things that don't work on well-trained GPTQ'd models.
+My coding agents and I ran ~30 experiments over the past two weeks trying
+to push below the 1.11 BPP frontier. Most of them failed. This documents
+the failures so others don't repeat them.
+
+Updates and extends my earlier negative results PR (#1186).
+
+## Eval-time techniques that don't work on strong models
+
+These all work on weak/undertrained models but provide zero benefit once
+your base model is well-trained with Full Hessian GPTQ + sliding window
+eval:
+
+| Technique | BPP delta | Why it fails |
+|-----------|:---------:|:-------------|
+| Properly normalized n-gram (Kneser-Ney, exact trie) | +0.001 to -0.003 | Model is 100x better than n-gram at predicting the correct token. Mixing at any alpha dilutes model confidence. Confirms PR #511 (-0.001) and PR #1145 (-0.003). |
+| Online Logit Bias (per-token SGD on logit bias vector) | +0.003 (hurts) | GPTQ'd model is already well-calibrated. No systematic bias to correct. Also takes 1229s (way over eval budget). |
+| Prime MLP Adapters (zero-init rank-64, PR #1222 approach) | -0.00009 | PR #1222 got -0.073 but on a 1.50 BPP baseline. Our 1.11 baseline leaves no room — sliding window context already provides everything adapters would learn. |
+| Complementary Training (down-weight n-gram-predictable tokens during training) | -0.0004 (noise) | Doesn't change model behavior enough. By the time the model converges, it already knows everything the bigram knows. |
+| Score-first chunked TTT (PR #549 approach) | -0.003 | Works but the gain is tiny on GPTQ'd models. PR #1184 also found TTT "neutral" on their stack. |
+
+## The n-gram normalization proof
+
+I built the best possible legal n-gram cache: Kneser-Ney smoothing with
+exact trie counts (zero hashing, zero collisions), order 7, full
+normalized distribution over all 1024 tokens at every position.
+
+Results on 500K positions:
+- Max normalization error: **1.78e-15** (distributions are perfect)
+- Zero normalization violations across all positions
+- N-gram avg NLL: **5.40** vs model avg NLL: **0.79** (n-gram is 6.8x worse)
+- Mixing at ANY alpha hurts on average
+
+The entire 0.09-0.97 BPP improvement from hashed n-gram caches was a
+measurement artifact from unnormalized distributions. The real signal from
+properly normalized n-grams is 0.001-0.003 BPP, so it's not worth the complexity.
+
+## SLOT violates causal dependence
+
+Detailed in my PR #1240. 100% violation rate across 240 tested pairs.
+Self-prediction advantage: +0.24 nats (shared delta), +0.73 nats
+(per-sample). Every SLOT-based result on the leaderboard is suspect.
+
+## Scylla tokenizer doesn't help (with correct accounting)
+
+Covered in my other PR. With corrected byte accounting, Scylla gets 1.1289
+BPP, the same as SP1024 at 1.1157. The entire sub-1.0 claim was a byte
+accounting bug in `candidate.meta.npz`.
+
+## What actually matters
+
+After all these experiments, the model quality is dominated by:
+1. **Training data volume** (194+ shards > 80 shards)
+2. **Full Hessian GPTQ** (Cholesky + actorder, ~0.005 BPP over naive int6)
+3. **Coprime-stride data loader** (batch diversity)
+4. **XSA on all layers** (small but consistent gain with coprime loader)
+
+## Files included
+
+- `ngram_test.py` — Kneser-Ney trie with full normalization proof
+- `online_logit_bias.py` — Online logit bias implementation + synthetic test
+- `correct_meta.npz` — Corrected Scylla byte accounting
+- `retokenize_proper.py` — Proper retokenization with official train/val split
diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz
new file mode 100644
index 0000000000..567ed9cf8b
Binary files /dev/null and b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz differ
diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py
new file mode 100644
index 0000000000..6af6756fe2
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py
@@ -0,0 +1,452 @@
+"""
+Definitive test: Do properly normalized n-gram caches help?
+
+This script builds a trie-based n-gram language model with Kneser-Ney smoothing
+over the validation set, produces FULL normalized distributions over all vocab tokens
+at every position, and measures the real BPB improvement when mixed with a neural LM.
+
+Key properties:
+- Exact counts via trie (NO hashing — zero collisions)
+- Full distribution over all 1024 tokens at every position (sum = 1.0, verified)
+- Score-first: n-gram scores use only tokens BEFORE the current position
+- Kneser-Ney smoothing (the gold standard for statistical n-gram LMs)
+- Interpolated mixing: p_mixed = (1-alpha) * p_model + alpha * p_ngram
+
+Usage:
+    python3 ngram_test.py --model-path final_model.int6.ptz --val-path fineweb_val_000000.bin
+
+Or standalone (no neural model, just n-gram vs uniform baseline):
+    python3 ngram_test.py --standalone --val-path fineweb_val_000000.bin
+"""
+
+import argparse
+import math
+import os
+import sys
+import time
+from collections import defaultdict
+from typing import Optional
+
+import numpy as np
+
+
+# ============================================================
+# Trie-based N-gram Count Store (exact counts, no hashing)
+# ============================================================
+
+class TrieNode:
+    __slots__ = ['children', 'count']
+    def __init__(self):
+        self.children: dict[int, 'TrieNode'] = {}
+        self.count: int = 0
+
+
+class NgramTrie:
+    """Exact-count n-gram storage using a trie. No hashing, no collisions."""
+
+    def __init__(self, max_order: int = 5, vocab_size: int = 1024):
+        self.max_order = max_order
+        self.vocab_size = vocab_size
+        self.root = TrieNode()
+        self.total_tokens = 0
+        # For Kneser-Ney: count of unique contexts each token appears in
+        self.continuation_counts = np.zeros(vocab_size, dtype=np.int64)
+        self.total_continuations = 0
+
+    def add_ngram(self, tokens: list[int]):
+        """Add an n-gram (list of token IDs) to the trie."""
+        node = self.root
+        for tok in tokens:
+            if tok not in node.children:
+                node.children[tok] = TrieNode()
+            node = node.children[tok]
+            node.count += 1
+
+    def get_count(self, tokens: list[int]) -> int:
+        """Get exact count for an n-gram."""
+        node = self.root
+        for tok in tokens:
+            if tok not in node.children:
+                return 0
+            node = node.children[tok]
+        return node.count
+
+    def get_children(self, context: list[int]) -> dict[int, int]:
+        """Get all children (next tokens) and their counts for a context."""
+        node = self.root
+        for tok in context:
+            if tok not in node.children:
+                return {}
+            node = node.children[tok]
+        return {tok: child.count for tok, child in node.children.items()}
+
+
+# ============================================================
+# Kneser-Ney Smoothed N-gram Language Model
+# ============================================================
+
+class KneserNeyNgram:
+    """
+    Modified Kneser-Ney smoothed n-gram LM.
+
+    Produces a FULL probability distribution over all vocab tokens
+    at every position. The distribution is guaranteed to sum to 1.0.
+
+    Uses interpolated Kneser-Ney:
+        p_KN(w | context) = max(c(context, w) - d, 0) / c(context)
+                          + d * N1+(context, .) / c(context) * p_KN(w | shorter_context)
+
+    Where:
+        d = discount (typically 0.75)
+        N1+(context, .) = number of unique tokens following context
+        p_KN(w | shorter_context) = recursive backoff
+
+    Base case (unigram): uses continuation counts (how many unique
+    contexts each token appears in), not raw frequency.
+    """
+
+    def __init__(self, max_order: int = 5, vocab_size: int = 1024, discount: float = 0.75):
+        self.max_order = max_order
+        self.vocab_size = vocab_size
+        self.discount = discount
+        self.trie = NgramTrie(max_order, vocab_size)
+        # Precomputed unigram distribution (continuation counts)
+        self._unigram_dist: Optional[np.ndarray] = None
+        # Cache for context statistics
+        self._built = False
+
+    def update(self, tokens: np.ndarray, position: int):
+        """
+        Update the model with tokens up to (not including) position.
+        Score-first: only uses tokens that have already been scored.
+
+        Call this AFTER scoring position, BEFORE scoring position+1.
+        """
+        # Add all n-grams ending at this position
+        for order in range(1, self.max_order + 1):
+            start = position - order + 1
+            if start < 0:
+                continue
+            ngram = tokens[start:position + 1].tolist()
+            self.trie.add_ngram(ngram)
+
+        self.trie.total_tokens += 1
+        self._built = False  # invalidate cached unigram
+
+    def get_distribution(self, tokens: np.ndarray, position: int) -> np.ndarray:
+        """
+        Get a full probability distribution over all vocab tokens
+        for the next token at `position`, using only tokens before `position`.
+
+        Returns: np.ndarray of shape (vocab_size,) summing to 1.0
+        """
+        dist = np.zeros(self.vocab_size, dtype=np.float64)
+
+        # Try each order from highest to lowest
+        for order in range(min(self.max_order, position), 0, -1):
+            context = tokens[position - order:position].tolist()
+            context_count = self.trie.get_count(context)
+
+            if context_count == 0:
+                continue
+
+            # Get children of this context
+            children = self.trie.get_children(context)
+            num_unique = len(children)
+
+            if num_unique == 0:
+                continue
+
+            # Interpolated Kneser-Ney for this order
+            for tok_id in range(self.vocab_size):
+                tok_count = children.get(tok_id, 0)
+                # Main term: max(count - discount, 0) / context_count
+                main = max(tok_count - self.discount, 0.0) / context_count
+                dist[tok_id] = main
+
+            # Backoff weight
+            backoff_weight = (self.discount * num_unique) / context_count
+
+            # Get lower-order distribution
+            lower_dist = self._get_lower_order_dist(tokens, position, order - 1)
+
+            # Interpolate
+            dist = dist + backoff_weight * lower_dist
+
+            # Verify normalization
+            total = dist.sum()
+            if total > 0 and abs(total - 1.0) > 1e-6:
+                dist /= total
+
+            return dist
+
+        # Fallback: uniform distribution (no context matches)
+        return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size
+
+    def _get_lower_order_dist(self, tokens: np.ndarray, position: int, order: int) -> np.ndarray:
+        """Get distribution for a lower-order context (recursive backoff)."""
+        if order == 0:
+            # Unigram: use continuation counts or uniform
+            if self.trie.total_tokens == 0:
+                return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size
+
+            # Simple unigram from counts
+            dist = np.zeros(self.vocab_size, dtype=np.float64)
+            unigram_children = self.trie.get_children([])
+            total = sum(unigram_children.values())
+            if total == 0:
+                return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size
+
+            for tok_id, count in unigram_children.items():
+                dist[tok_id] = count / total
+
+            # Add small floor for unseen tokens
+            floor = 1e-10
+            dist = dist + floor
+            dist /= dist.sum()
+            return dist
+
+        context = tokens[position - order:position].tolist()
+        context_count = self.trie.get_count(context)
+
+        if context_count == 0:
+            return self._get_lower_order_dist(tokens, position, order - 1)
+
+        children = self.trie.get_children(context)
+        num_unique = len(children)
+
+        dist = np.zeros(self.vocab_size, dtype=np.float64)
+        for tok_id in range(self.vocab_size):
+            tok_count = children.get(tok_id, 0)
+            dist[tok_id] = max(tok_count - self.discount, 0.0) / context_count
+
+        backoff_weight = (self.discount * num_unique) / context_count
+        lower = self._get_lower_order_dist(tokens, position, order - 1)
+        dist = dist + backoff_weight * lower
+
+        total = dist.sum()
+        if total > 0:
+            dist /= total
+
+        return dist
+
+
+# ============================================================
+# Evaluation
+# ============================================================
+
+def load_val_tokens(path: str) -> np.ndarray:
+    """Load validation tokens from a .bin shard."""
+    raw = np.fromfile(path, dtype=np.uint8)
+    # Header: 256 int32s
+    header = np.frombuffer(raw[:1024], dtype=np.int32)
+    magic, version, num_tokens = header[0], header[1], header[2]
+    assert magic == 20240520, f"Bad magic: {magic}"
+    tokens = np.frombuffer(raw[1024:1024 + num_tokens * 2], dtype=np.uint16).astype(np.int64)
+    return tokens
+
+
+def evaluate_ngram_standalone(val_tokens: np.ndarray, max_order: int = 5,
+                              vocab_size: int = 1024, max_positions: int = 100000,
+                              discount: float = 0.75):
+    """
+    Evaluate a standalone Kneser-Ney n-gram LM on val tokens.
+    Score-first: at each position, score with current model, then update.
+    """
+    ngram = KneserNeyNgram(max_order=max_order, vocab_size=vocab_size, discount=discount)
+
+    total_positions = min(len(val_tokens) - 1, max_positions)
+    total_nll = 0.0
+    normalization_errors = 0
+    max_norm_error = 0.0
+
+    t0 = time.perf_counter()
+
+    for pos in range(total_positions):
+        target = int(val_tokens[pos + 1])
+
+        # Score first (using only tokens 0..pos)
+        if pos < 1:
+            # No context yet — uniform
+            prob = 1.0 / vocab_size
+        else:
+            dist = ngram.get_distribution(val_tokens, pos + 1)
+
+            # Verify normalization
+            dist_sum = dist.sum()
+            norm_error = abs(dist_sum - 1.0)
+            max_norm_error = max(max_norm_error, norm_error)
+            if norm_error > 1e-4:
+                normalization_errors += 1
+
+            prob = dist[target]
+
+        prob = max(prob, 1e-12)  # floor to avoid log(0)
+        total_nll += -math.log(prob)
+
+        # Update after scoring
+        ngram.update(val_tokens, pos + 1)
+
+        if (pos + 1) % 10000 == 0:
+            elapsed = time.perf_counter() - t0
+            avg_nll = total_nll / (pos + 1)
+            bpb_est = avg_nll / math.log(2) * (vocab_size / 1024)  # rough BPB
+            speed = (pos + 1) / elapsed
+            print(f"  pos {pos+1}/{total_positions}: avg_nll={avg_nll:.4f} "
+                  f"speed={speed:.0f} tok/s max_norm_err={max_norm_error:.2e} "
+                  f"norm_violations={normalization_errors}")
+
+    avg_nll = total_nll / total_positions
+    avg_bpb_approx = avg_nll / math.log(2)
+    elapsed = time.perf_counter() - t0
+
+    print(f"\n{'='*60}")
+    print(f"Kneser-Ney N-gram (order={max_order}, discount={discount})")
+    print(f"Positions evaluated: {total_positions}")
+    print(f"Average NLL: {avg_nll:.6f}")
+    print(f"Approx BPB (token-level): {avg_bpb_approx:.6f}")
+    print(f"Max normalization error: {max_norm_error:.2e}")
+    print(f"Normalization violations (>1e-4): {normalization_errors}")
+    print(f"Time: {elapsed:.1f}s ({total_positions/elapsed:.0f} tok/s)")
+    print(f"{'='*60}")
+
+    return avg_nll, avg_bpb_approx
+
+
+def evaluate_mixed(val_tokens: np.ndarray, model_nll: np.ndarray,
+                   max_order: int = 5, vocab_size: int = 1024,
+                   alphas: list[float] = [0.01, 0.05, 0.10, 0.20, 0.30],
+                   max_positions: int = 100000, discount: float = 0.75):
+    """
+    Evaluate mixed distribution: p_mixed = (1-alpha)*p_model + alpha*p_ngram
+
+    model_nll: per-position NLL from the neural model (precomputed)
+
+    This is the key experiment: does mixing help?
+    """
+    ngram = KneserNeyNgram(max_order=max_order, vocab_size=vocab_size, discount=discount)
+
+    total_positions = min(len(val_tokens) - 1, min(len(model_nll), max_positions))
+
+    # Track NLL for each alpha
+    nll_sums = {alpha: 0.0 for alpha in alphas}
+    nll_baseline = 0.0  # model-only
+    norm_errors = 0
+
+    t0 = time.perf_counter()
+
+    for pos in range(total_positions):
+        target = int(val_tokens[pos + 1])
+        p_model_target = math.exp(-model_nll[pos])  # model's prob for correct token
+
+        nll_baseline += model_nll[pos]
+
+        if pos < 1:
+            # No n-gram context — mixed = model
+            for alpha in alphas:
+                nll_sums[alpha] += model_nll[pos]
+        else:
+            dist_ngram = ngram.get_distribution(val_tokens, pos + 1)
+
+            # Verify normalization
+            dist_sum = dist_ngram.sum()
+            if abs(dist_sum - 1.0) > 1e-4:
+                norm_errors += 1
+
+            p_ngram_target = dist_ngram[target]
+
+            for alpha in alphas:
+                p_mixed = (1.0 - alpha) * p_model_target + alpha * p_ngram_target
+                p_mixed = max(p_mixed, 1e-12)
+                nll_sums[alpha] += -math.log(p_mixed)
+
+        # Update after scoring
+        ngram.update(val_tokens, pos + 1)
+
+        if (pos + 1) % 10000 == 0:
+            elapsed = time.perf_counter() - t0
+            print(f"  pos {pos+1}/{total_positions}: "
+                  f"baseline_nll={nll_baseline/(pos+1):.4f} "
+                  f"best_mixed_nll={min(v/(pos+1) for v in nll_sums.values()):.4f} "
+                  f"speed={((pos+1)/elapsed):.0f} tok/s")
+
+    elapsed = time.perf_counter() - t0
+
+    print(f"\n{'='*60}")
+    print(f"MIXED DISTRIBUTION RESULTS (Kneser-Ney order={max_order}, d={discount})")
+    print(f"Positions: {total_positions} | Norm violations: {norm_errors}")
+    print(f"Time: {elapsed:.1f}s")
+    print(f"{'='*60}")
+    print(f"{'Alpha':<10} {'Avg NLL':<12} {'Approx BPB':<12} {'Delta NLL':<12} {'Delta BPB':<12}")
+    print(f"{'-'*58}")
+
+    baseline_avg = nll_baseline / total_positions
+    baseline_bpb = baseline_avg / math.log(2)
+    print(f"{'model':10} {baseline_avg:<12.6f} {baseline_bpb:<12.6f} {'—':12} {'—':12}")
+
+    for alpha in alphas:
+        avg = nll_sums[alpha] / total_positions
+        bpb = avg / math.log(2)
+        delta_nll = avg - baseline_avg
+        delta_bpb = bpb - baseline_bpb
+        marker = " ***" if delta_nll < -0.0001 else ""
+        print(f"{alpha:<10.2f} {avg:<12.6f} {bpb:<12.6f} {delta_nll:<+12.6f} {delta_bpb:<+12.6f}{marker}")
+
+    print(f"{'='*60}")
+
+    best_alpha = min(alphas, key=lambda a: nll_sums[a])
+    best_delta = (nll_sums[best_alpha] / total_positions) - baseline_avg
+    print(f"\nBest alpha: {best_alpha} (delta NLL: {best_delta:+.6f})")
+    if abs(best_delta) < 0.001:
+        print("CONCLUSION: N-gram provides negligible improvement (<0.001 NLL)")
+    elif best_delta < 0:
+        print(f"CONCLUSION: N-gram provides real improvement of {-best_delta:.4f} NLL")
+    else:
+        print("CONCLUSION: N-gram HURTS — model-only is better")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Definitive n-gram normalization test")
+    parser.add_argument("--val-path", type=str, required=True, help="Path to val .bin shard")
+    parser.add_argument("--model-nll-path", type=str, default="",
+                        help="Path to precomputed model NLL (.npy). If empty, runs standalone only.")
+    parser.add_argument("--max-order", type=int, default=5, help="Max n-gram order")
+    parser.add_argument("--max-positions", type=int, default=100000,
+                        help="Max positions to evaluate (default 100K for speed)")
+    parser.add_argument("--discount", type=float, default=0.75, help="Kneser-Ney discount")
+    parser.add_argument("--vocab-size", type=int, default=1024)
+    parser.add_argument("--standalone", action="store_true", help="Run standalone n-gram eval only")
+    parser.add_argument("--alphas", type=str, default="0.01,0.05,0.10,0.20,0.30,0.50",
+                        help="Comma-separated alpha values for mixing")
+    args = parser.parse_args()
+
+    print(f"Loading val tokens from {args.val_path}...")
+    val_tokens = load_val_tokens(args.val_path)
+    print(f"Loaded {len(val_tokens)} tokens")
+
+    if args.standalone:
+        print(f"\n=== Standalone Kneser-Ney N-gram (order={args.max_order}) ===")
+        evaluate_ngram_standalone(
+            val_tokens, max_order=args.max_order, vocab_size=args.vocab_size,
+            max_positions=args.max_positions, discount=args.discount,
+        )
+    elif args.model_nll_path:
+        print(f"\nLoading model NLL from {args.model_nll_path}...")
+        model_nll = np.load(args.model_nll_path)
+        print(f"Loaded {len(model_nll)} NLL values")
+
+        alphas = [float(a) for a in args.alphas.split(",")]
+
+        print(f"\n=== Mixed Distribution Test ===")
+        evaluate_mixed(
+            val_tokens, model_nll, max_order=args.max_order,
+            vocab_size=args.vocab_size, alphas=alphas,
+            max_positions=args.max_positions, discount=args.discount,
+        )
+    else:
+        print("ERROR: Provide --model-nll-path for mixed eval, or --standalone for n-gram only")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py
new file mode 100644
index 0000000000..5593c626cf
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py
@@ -0,0 +1,295 @@
+"""
+Online Logit Bias Adaptation for eval-time BPB improvement.
+
+After scoring each token, updates a learnable bias vector in logit space
+using the gradient of cross-entropy loss. This corrects systematic biases
+from quantization and adapts to document-level token frequency shifts.
+
+Legality:
+- Score-before-update: bias at position t only uses gradients from positions 1..t-1
+- Full normalized distribution: softmax(logits + bias) sums to 1.0 by construction
+- Single left-to-right pass: no rescoring
+- Causal: no future token information used (gradient uses only the scored target)
+
+Integration:
+    # In eval loop, after getting logits from model:
+    olb = OnlineLogitBias(vocab_size=1024, lr=0.01)
+    for each position:
+        adjusted_logits = logits + olb.bias
+        score = -log(softmax(adjusted_logits)[target])  # this is the final score
+        olb.update(adjusted_logits, target)  # update AFTER scoring
+"""
+
+import math
+import numpy as np
+try:
+    import torch
+    import torch.nn.functional as F
+    from torch import Tensor
+    _HAS_TORCH = True
+except ImportError:
+    _HAS_TORCH = False
+
+
+class OnlineLogitBias:
+    """
+    Maintains a learnable bias vector added to logits, updated online via SGD.
+
+    After scoring token t with logits + bias, the gradient of CE loss w.r.t. bias is:
+        grad = softmax(logits + bias) - one_hot(target)
+
+    This is the simplest possible online adaptation — no backprop through the model.
+    """
+
+    def __init__(self, vocab_size: int = 1024, lr: float = 0.01,
+                 momentum: float = 0.9, weight_decay: float = 0.0,
+                 device: str = "cpu"):
+        self.vocab_size = vocab_size
+        self.lr = lr
+        self.momentum = momentum
+        self.weight_decay = weight_decay
+        self.bias = np.zeros(vocab_size, dtype=np.float64)
+        self.velocity = np.zeros(vocab_size, dtype=np.float64)
+        self.step_count = 0
+
+    def get_bias(self) -> np.ndarray:
+        """Get current bias vector."""
+        return self.bias
+
+    def update(self, logits: np.ndarray, target: int):
+        """
+        Update bias after scoring. Must be called AFTER the score is recorded.
+
+        Args:
+            logits: raw model logits at this position, shape (vocab_size,)
+            target: the correct token ID (already scored)
+        """
+        # Compute softmax(logits + bias)
+        adjusted = logits + self.bias
+        adjusted -= adjusted.max()  # numerical stability
+        exp_adj = np.exp(adjusted)
+        probs = exp_adj / exp_adj.sum()
+
+        # Gradient of CE loss w.r.t. bias: softmax - one_hot
+        grad = probs.copy()
+        grad[target] -= 1.0
+
+        # Weight decay
+        if self.weight_decay > 0:
+            grad += self.weight_decay * self.bias
+
+        # SGD with momentum
+        self.velocity = self.momentum * self.velocity + grad
+        self.bias -= self.lr * self.velocity
+        self.step_count += 1
+
+
+class OnlineLogitBiasPerDocument(OnlineLogitBias):
+    """
+    Resets the bias at document boundaries (BOS tokens).
+    This prevents cross-document contamination and allows
+    fresh adaptation for each document's token distribution.
+    """
+
+    def __init__(self, vocab_size: int = 1024, lr: float = 0.01,
+                 momentum: float = 0.9, weight_decay: float = 0.001,
+                 bos_token: int = 0, reset_on_bos: bool = True,
+                 device: str = "cpu"):
+        super().__init__(vocab_size, lr, momentum, weight_decay, device)
+        self.bos_token = bos_token
+        self.reset_on_bos = reset_on_bos
+
+    def update(self, logits: np.ndarray, target: int):
+        """Update bias, resetting at document boundaries."""
+        if self.reset_on_bos and target == self.bos_token:
+            self.bias[:] = 0.0
+            self.velocity[:] = 0.0
+            self.step_count = 0
+        super().update(logits, target)
+
+
+def eval_with_online_logit_bias(
+    model_logits_fn,  # callable: (x_batch) -> logits tensor
+    val_tokens: np.ndarray,
+    vocab_size: int = 1024,
+    seq_len: int = 2048,
+    stride: int = 64,
+    lr: float = 0.01,
+    momentum: float = 0.9,
+    weight_decay: float = 0.0,
+    max_positions: int = 0,  # 0 = all
+    device: str = "cuda",
+    log_fn=print,
+) -> tuple[float, float, int]:
+    """
+    Sliding-window eval with online logit bias adaptation.
+
+    Returns (avg_nll, approx_bpb, num_positions)
+    """
+    import time
+    import torch
+
+    total = len(val_tokens) - 1
+    if max_positions > 0:
+        total = min(total, max_positions)
+
+    olb = OnlineLogitBias(vocab_size=vocab_size, lr=lr, momentum=momentum,
+                          weight_decay=weight_decay)
+
+    # We need per-position logits. For efficiency, process in sliding windows
+    # but track which positions have been scored.
+    window_starts = [ws for ws in range(0, total, stride)
+                     if min(ws + seq_len, total) - ws >= 1]
+
+    scored = np.zeros(total, dtype=bool)
+    nll_values = np.zeros(total, dtype=np.float64)
+    nll_baseline = np.zeros(total, dtype=np.float64)
+
+    t0 = time.perf_counter()
+    batch_size = 32
+
+    val_tensor = torch.from_numpy(val_tokens.astype(np.int64))
+
+    with torch.inference_mode():
+        for bi in range(0, len(window_starts), batch_size):
+            batch_ws = window_starts[bi:bi + batch_size]
+            bsz = len(batch_ws)
+            xb = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            yb = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            wlens = []
+
+            for i, ws in enumerate(batch_ws):
+                end = min(ws + seq_len, total)
+                wlen = end - ws
+                wlens.append(wlen)
+                chunk = val_tensor[ws:end + 1].to(device=device)
+                xb[i, :wlen] = chunk[:-1]
+                yb[i, :wlen] = chunk[1:]
+
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                logits = model_logits_fn(xb)
+
+            logits_f = logits.float().cpu().numpy()
+            yb_np = yb.cpu().numpy()
+
+            for i, ws in enumerate(batch_ws):
+                wlen = wlens[i]
+                s = 0 if ws == 0 else max(wlen - stride, 0)
+
+                for j in range(s, wlen):
+                    pos = ws + j
+                    if pos >= total or scored[pos]:
+                        continue
+
+                    target = int(yb_np[i, j])
+                    pos_logits = logits_f[i, j, :vocab_size]
+
+                    # Score WITH bias (the reported score)
+                    adjusted = pos_logits + olb.get_bias()
+                    adjusted_shifted = adjusted - adjusted.max()
+                    log_sum_exp = np.log(np.exp(adjusted_shifted).sum())
+                    nll_with_bias = -(adjusted_shifted[target] - log_sum_exp)
+
+                    # Score WITHOUT bias (baseline)
+                    baseline_shifted = pos_logits - pos_logits.max()
+                    log_sum_exp_b = np.log(np.exp(baseline_shifted).sum())
+                    nll_no_bias = -(baseline_shifted[target] - log_sum_exp_b)
+
+                    nll_values[pos] = nll_with_bias
+                    nll_baseline[pos] = nll_no_bias
+                    scored[pos] = True
+
+                    # Update AFTER scoring
+                    olb.update(pos_logits, target)
+
+            if bi % (batch_size * 50) == 0 and bi > 0:
+                n_scored = scored.sum()
+                avg_nll = nll_values[scored].mean()
+                avg_base = nll_baseline[scored].mean()
+                delta = avg_nll - avg_base
+                elapsed = time.perf_counter() - t0
+                log_fn(f"  scored {n_scored}/{total}: "
+                       f"baseline_nll={avg_base:.6f} olb_nll={avg_nll:.6f} "
+                       f"delta={delta:+.6f} speed={n_scored/elapsed:.0f}/s")
+
+    n_scored = int(scored.sum())
+    avg_nll = float(nll_values[scored].mean())
+    avg_base = float(nll_baseline[scored].mean())
+    delta = avg_nll - avg_base
+    bpb = avg_nll / math.log(2)
+    bpb_base = avg_base / math.log(2)
+
+    log_fn(f"\n{'='*60}")
+    log_fn(f"ONLINE LOGIT BIAS RESULTS (lr={lr}, mom={momentum}, wd={weight_decay})")
+    log_fn(f"Positions scored: {n_scored}")
+    log_fn(f"Baseline NLL: {avg_base:.6f} (BPB: {bpb_base:.6f})")
+    log_fn(f"With OLB NLL: {avg_nll:.6f} (BPB: {bpb:.6f})")
+    log_fn(f"Delta NLL: {delta:+.6f} (Delta BPB: {delta/math.log(2):+.6f})")
+    log_fn(f"Time: {time.perf_counter()-t0:.1f}s")
+    log_fn(f"{'='*60}")
+
+    return avg_nll, bpb, n_scored
+
+
+# ============================================================
+# Standalone test (CPU, no model — synthetic logits)
+# ============================================================
+
+def test_synthetic():
+    """Test OLB on synthetic data to verify correctness."""
+    print("=== Synthetic OLB Test ===")
+    np.random.seed(42)
+    vocab = 1024
+    N = 50000
+
+    # Simulate: tokens come from a distribution that shifts over time
+    # First half: tokens 0-100 are common
+    # Second half: tokens 500-600 are common
+    tokens = np.zeros(N, dtype=np.int64)
+    for i in range(N):
+        if i < N // 2:
+            tokens[i] = np.random.choice(100)
+        else:
+            tokens[i] = 500 + np.random.choice(100)
+
+    # Simulate model logits: uniform (bad model that doesn't know the distribution)
+    olb = OnlineLogitBias(vocab_size=vocab, lr=0.05, momentum=0.9, weight_decay=0.001)
+
+    nll_no_bias = 0.0
+    nll_with_bias = 0.0
+    uniform_nll = math.log(vocab)
+
+    for i in range(1, N):
+        target = tokens[i]
+        logits = np.zeros(vocab)  # uniform model
+
+        # Score with bias
+        adjusted = logits + olb.bias
+        adjusted -= adjusted.max()
+        probs = np.exp(adjusted) / np.exp(adjusted).sum()
+        nll_with_bias += -math.log(max(probs[target], 1e-12))
+
+        # Score without bias
+        nll_no_bias += uniform_nll
+
+        # Update after scoring
+        olb.update(logits, target)
+
+    avg_no = nll_no_bias / (N - 1)
+    avg_with = nll_with_bias / (N - 1)
+    print(f"Uniform model NLL: {avg_no:.4f}")
+    print(f"With OLB NLL:      {avg_with:.4f}")
+    print(f"Delta:             {avg_with - avg_no:+.4f}")
+    print(f"OLB learned to track the shifting distribution!")
+    print()
+
+    # Verify bias reflects the distribution shift
+    top_first_half = olb.bias[:100].mean()
+    top_second_half = olb.bias[500:600].mean()
+    print(f"Bias for tokens 0-99 (first half common):   {top_first_half:+.4f}")
+    print(f"Bias for tokens 500-599 (second half common): {top_second_half:+.4f}")
+    print(f"Second half tokens should have higher bias (they were seen more recently)")
+
+
+if __name__ == "__main__":
+    test_synthetic()
diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py
new file mode 100644
index 0000000000..9a65d9343e
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py
@@ -0,0 +1,157 @@
+"""Proper Scylla retokenization: split train/val from raw docs, no SP1024 roundtrip.
+Matches the official manifest: shuffle with seed 1337, last 50K docs = val.
+Memory-efficient: workers read from disk, not from in-memory lists."""
+import json
+import os
+import sys
+import time
+import random
+import math
+import numpy as np
+from pathlib import Path
+from multiprocessing import Process, Queue
+
+HEADER_INTS = 256
+HEADER_MAGIC = 20240520
+HEADER_VERSION = 1
+TOKENS_PER_SHARD = 100_000_000
+NUM_VAL_DOCS = 50000
+SHUFFLE_SEED = 1337
+
+
+def write_shard(path, tokens):
+    header = np.zeros(HEADER_INTS, dtype="<i4")
+    header[0] = HEADER_MAGIC
+    header[1] = HEADER_VERSION
+    header[2] = len(tokens)
+    with open(path, "wb") as f:
+        f.write(header.tobytes())
+        f.write(tokens.astype("<u2").tobytes())
+
+
+def worker_fn(worker_id, line_indices, docs_path, vocab_path, out_dir, result_queue):
+    """Tokenize docs at given line indices (contiguous range) into shards."""
+    import tokenmonster
+    vocab = tokenmonster.load_multiprocess_safe(vocab_path)
+
+    # Read only our lines from the JSONL
+    line_set = set(line_indices)
+    buffer = []
+    shard_count = 0
+
+    with open(docs_path, "r") as f:
+        for line_num, line in enumerate(f):
+            if line_num not in line_set:
+                continue
+            doc = json.loads(line)
+            text = doc.get("text", "")
+            if not text:
+                continue
+            tokens = vocab.tokenize(text)
+            buffer.extend(tokens)
+
+            while len(buffer) >= TOKENS_PER_SHARD:
+                shard_tokens = np.array(buffer[:TOKENS_PER_SHARD], dtype=np.uint16)
+                write_shard(
+                    Path(out_dir) / f"fineweb_train_w{worker_id:02d}_{shard_count:04d}.bin",
+                    shard_tokens,
+                )
+                buffer = buffer[TOKENS_PER_SHARD:]
+                shard_count += 1
+
+    if buffer:
+        write_shard(
+            Path(out_dir) / f"fineweb_train_w{worker_id:02d}_{shard_count:04d}.bin",
+            np.array(buffer, dtype=np.uint16),
+        )
+        shard_count += 1
+
+    result_queue.put((worker_id, shard_count))
+    print(f"  Worker {worker_id}: {shard_count} shards", flush=True)
+
+
+def main():
+    vocab_path = os.environ.get("VOCAB_PATH", "/workspace/candidate.vocab")
+    docs_path = os.environ.get("DOCS_PATH", "/workspace/raw_docs/datasets/docs_selected.jsonl")
+    out_dir = Path(os.environ.get("OUTPUT_DIR", "/workspace/fineweb_scylla"))
+    num_workers = int(os.environ.get("NUM_WORKERS", "16"))
+
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    # --- Step 1: Count lines and determine split ---
+    print("Counting docs...", flush=True)
+    t0 = time.time()
+    total_lines = 0
+    with open(docs_path, "r") as f:
+        for _ in f:
+            total_lines += 1
+    print(f"Total: {total_lines} docs in {time.time()-t0:.0f}s", flush=True)
+
+    # Shuffle indices (matching official manifest)
+    print(f"Shuffling with seed {SHUFFLE_SEED}...", flush=True)
+    indices = list(range(total_lines))
+    random.seed(SHUFFLE_SEED)
+    random.shuffle(indices)
+
+    val_indices = set(indices[-NUM_VAL_DOCS:])
+    train_indices = indices[:-NUM_VAL_DOCS]
+    print(f"Train: {len(train_indices)} docs, Val: {len(val_indices)} docs", flush=True)
+
+    # --- Step 2: Tokenize val (single process) ---
+    print("Tokenizing val docs...", flush=True)
+    import tokenmonster
+    vocab = tokenmonster.load(vocab_path)
+    val_tokens = []
+    with open(docs_path, "r") as f:
+        for line_num, line in enumerate(f):
+            if line_num not in val_indices:
+                continue
+            doc = json.loads(line)
+            text = doc.get("text", "")
+            if text:
+                val_tokens.extend(vocab.tokenize(text))
+    write_shard(out_dir / "fineweb_val_000000.bin", np.array(val_tokens, dtype=np.uint16))
+    print(f"Val: {len(val_tokens)} tokens", flush=True)
+    del val_tokens, vocab
+
+    # --- Step 3: Split train indices into contiguous chunks for workers ---
+    # Sort train indices so each worker processes docs in file order (fast sequential read)
+    train_indices.sort()
+    chunk_size = math.ceil(len(train_indices) / num_workers)
+    chunks = [train_indices[i*chunk_size:(i+1)*chunk_size] for i in range(num_workers)]
+    del train_indices
+
+    # --- Step 4: Launch parallel workers ---
+    print(f"Tokenizing train with {num_workers} workers...", flush=True)
+    t0 = time.time()
+    result_queue = Queue()
+    workers = []
+    for i in range(num_workers):
+        p = Process(target=worker_fn, args=(i, chunks[i], docs_path, vocab_path, str(out_dir), result_queue))
+        p.start()
+        workers.append(p)
+
+    for p in workers:
+        p.join()
+
+    results = []
+    while not result_queue.empty():
+        results.append(result_queue.get())
+    results.sort()
+    total_shards = sum(r[1] for r in results)
+    print(f"Workers done: {total_shards} shards in {time.time()-t0:.0f}s", flush=True)
+
+    # --- Step 5: Rename to sequential ---
+    shard_files = []
+    for wid in range(num_workers):
+        worker_files = sorted(out_dir.glob(f"fineweb_train_w{wid:02d}_*.bin"))
+        shard_files.extend(worker_files)
+
+    for idx, f in enumerate(shard_files):
+        f.rename(f.parent / f"fineweb_train_{idx:06d}.bin")
+
+    print(f"Done! {len(shard_files)} train + 1 val shards", flush=True)
+
+
+if __name__ == "__main__":
+    main()