diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md new file mode 100644 index 0000000000..c2242d93b7 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/README.md @@ -0,0 +1,65 @@ +## What this is + +A collection of things that don't work on well-trained GPTQ'd models. +My coding agents and I ran ~30 experiments over the past two weeks trying +to push below the 1.11 BPP frontier. Most of them failed. This documents +the failures so others don't repeat them. + +Updates and extends my earlier negative results PR (#1186). + +## Eval-time techniques that don't work on strong models + +These all work on weak/undertrained models but provide zero benefit once +your base model is well-trained with Full Hessian GPTQ + sliding window +eval: + +| Technique | BPP delta | Why it fails | +|-----------|:---------:|:-------------| +| Properly normalized n-gram (Kneser-Ney, exact trie) | +0.001 to -0.003 | Model is 100x better than n-gram at predicting the correct token. Mixing at any alpha dilutes model confidence. Confirms PR #511 (-0.001) and PR #1145 (-0.003). | +| Online Logit Bias (per-token SGD on logit bias vector) | +0.003 (hurts) | GPTQ'd model is already well-calibrated. No systematic bias to correct. Also takes 1229s (way over eval budget). | +| Prime MLP Adapters (zero-init rank-64, PR #1222 approach) | -0.00009 | PR #1222 got -0.073 but on a 1.50 BPP baseline. Our 1.11 baseline leaves no room — sliding window context already provides everything adapters would learn. | +| Complementary Training (down-weight n-gram-predictable tokens during training) | -0.0004 (noise) | Doesn't change model behavior enough. By the time the model converges, it already knows everything the bigram knows. | +| Score-first chunked TTT (PR #549 approach) | -0.003 | Works but the gain is tiny on GPTQ'd models. PR #1184 also found TTT "neutral" on their stack. | + +## The n-gram normalization proof + +I built the best possible legal n-gram cache: Kneser-Ney smoothing with +exact trie counts (zero hashing, zero collisions), order 7, full +normalized distribution over all 1024 tokens at every position. + +Results on 500K positions: +- Max normalization error: **1.78e-15** (distributions are perfect) +- Zero normalization violations across all positions +- N-gram avg NLL: **5.40** vs model avg NLL: **0.79** (n-gram is 6.8x worse) +- Mixing at ANY alpha hurts on average + +The entire 0.09-0.97 BPP improvement from hashed n-gram caches was a +measurement artifact from unnormalized distributions. The real signal from +properly normalized n-grams is 0.001-0.003 BPP, so it's not worth the complexity. + +## SLOT violates causal dependence + +Detailed in my PR #1240. 100% violation rate across 240 tested pairs. +Self-prediction advantage: +0.24 nats (shared delta), +0.73 nats +(per-sample). Every SLOT-based result on the leaderboard is suspect. + +## Scylla tokenizer doesn't help (with correct accounting) + +Covered in my other PR. With corrected byte accounting, Scylla gets 1.1289 +BPP, the same as SP1024 at 1.1157. The entire sub-1.0 claim was a byte +accounting bug in `candidate.meta.npz`. + +## What actually matters + +After all these experiments, the model quality is dominated by: +1. **Training data volume** (194+ shards > 80 shards) +2. **Full Hessian GPTQ** (Cholesky + actorder, ~0.005 BPP over naive int6) +3. **Coprime-stride data loader** (batch diversity) +4. **XSA on all layers** (small but consistent gain with coprime loader) + +## Files included + +- `ngram_test.py` — Kneser-Ney trie with full normalization proof +- `online_logit_bias.py` — Online logit bias implementation + synthetic test +- `correct_meta.npz` — Corrected Scylla byte accounting +- `retokenize_proper.py` — Proper retokenization with official train/val split diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz new file mode 100644 index 0000000000..567ed9cf8b Binary files /dev/null and b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/correct_meta.npz differ diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py new file mode 100644 index 0000000000..6af6756fe2 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/ngram_test.py @@ -0,0 +1,452 @@ +""" +Definitive test: Do properly normalized n-gram caches help? + +This script builds a trie-based n-gram language model with Kneser-Ney smoothing +over the validation set, produces FULL normalized distributions over all vocab tokens +at every position, and measures the real BPB improvement when mixed with a neural LM. + +Key properties: +- Exact counts via trie (NO hashing — zero collisions) +- Full distribution over all 1024 tokens at every position (sum = 1.0, verified) +- Score-first: n-gram scores use only tokens BEFORE the current position +- Kneser-Ney smoothing (the gold standard for statistical n-gram LMs) +- Interpolated mixing: p_mixed = (1-alpha) * p_model + alpha * p_ngram + +Usage: + python3 ngram_test.py --model-path final_model.int6.ptz --val-path fineweb_val_000000.bin + +Or standalone (no neural model, just n-gram vs uniform baseline): + python3 ngram_test.py --standalone --val-path fineweb_val_000000.bin +""" + +import argparse +import math +import os +import sys +import time +from collections import defaultdict +from typing import Optional + +import numpy as np + + +# ============================================================ +# Trie-based N-gram Count Store (exact counts, no hashing) +# ============================================================ + +class TrieNode: + __slots__ = ['children', 'count'] + def __init__(self): + self.children: dict[int, 'TrieNode'] = {} + self.count: int = 0 + + +class NgramTrie: + """Exact-count n-gram storage using a trie. No hashing, no collisions.""" + + def __init__(self, max_order: int = 5, vocab_size: int = 1024): + self.max_order = max_order + self.vocab_size = vocab_size + self.root = TrieNode() + self.total_tokens = 0 + # For Kneser-Ney: count of unique contexts each token appears in + self.continuation_counts = np.zeros(vocab_size, dtype=np.int64) + self.total_continuations = 0 + + def add_ngram(self, tokens: list[int]): + """Add an n-gram (list of token IDs) to the trie.""" + node = self.root + for tok in tokens: + if tok not in node.children: + node.children[tok] = TrieNode() + node = node.children[tok] + node.count += 1 + + def get_count(self, tokens: list[int]) -> int: + """Get exact count for an n-gram.""" + node = self.root + for tok in tokens: + if tok not in node.children: + return 0 + node = node.children[tok] + return node.count + + def get_children(self, context: list[int]) -> dict[int, int]: + """Get all children (next tokens) and their counts for a context.""" + node = self.root + for tok in context: + if tok not in node.children: + return {} + node = node.children[tok] + return {tok: child.count for tok, child in node.children.items()} + + +# ============================================================ +# Kneser-Ney Smoothed N-gram Language Model +# ============================================================ + +class KneserNeyNgram: + """ + Modified Kneser-Ney smoothed n-gram LM. + + Produces a FULL probability distribution over all vocab tokens + at every position. The distribution is guaranteed to sum to 1.0. + + Uses interpolated Kneser-Ney: + p_KN(w | context) = max(c(context, w) - d, 0) / c(context) + + d * N1+(context, .) / c(context) * p_KN(w | shorter_context) + + Where: + d = discount (typically 0.75) + N1+(context, .) = number of unique tokens following context + p_KN(w | shorter_context) = recursive backoff + + Base case (unigram): uses continuation counts (how many unique + contexts each token appears in), not raw frequency. + """ + + def __init__(self, max_order: int = 5, vocab_size: int = 1024, discount: float = 0.75): + self.max_order = max_order + self.vocab_size = vocab_size + self.discount = discount + self.trie = NgramTrie(max_order, vocab_size) + # Precomputed unigram distribution (continuation counts) + self._unigram_dist: Optional[np.ndarray] = None + # Cache for context statistics + self._built = False + + def update(self, tokens: np.ndarray, position: int): + """ + Update the model with tokens up to (not including) position. + Score-first: only uses tokens that have already been scored. + + Call this AFTER scoring position, BEFORE scoring position+1. + """ + # Add all n-grams ending at this position + for order in range(1, self.max_order + 1): + start = position - order + 1 + if start < 0: + continue + ngram = tokens[start:position + 1].tolist() + self.trie.add_ngram(ngram) + + self.trie.total_tokens += 1 + self._built = False # invalidate cached unigram + + def get_distribution(self, tokens: np.ndarray, position: int) -> np.ndarray: + """ + Get a full probability distribution over all vocab tokens + for the next token at `position`, using only tokens before `position`. + + Returns: np.ndarray of shape (vocab_size,) summing to 1.0 + """ + dist = np.zeros(self.vocab_size, dtype=np.float64) + + # Try each order from highest to lowest + for order in range(min(self.max_order, position), 0, -1): + context = tokens[position - order:position].tolist() + context_count = self.trie.get_count(context) + + if context_count == 0: + continue + + # Get children of this context + children = self.trie.get_children(context) + num_unique = len(children) + + if num_unique == 0: + continue + + # Interpolated Kneser-Ney for this order + for tok_id in range(self.vocab_size): + tok_count = children.get(tok_id, 0) + # Main term: max(count - discount, 0) / context_count + main = max(tok_count - self.discount, 0.0) / context_count + dist[tok_id] = main + + # Backoff weight + backoff_weight = (self.discount * num_unique) / context_count + + # Get lower-order distribution + lower_dist = self._get_lower_order_dist(tokens, position, order - 1) + + # Interpolate + dist = dist + backoff_weight * lower_dist + + # Verify normalization + total = dist.sum() + if total > 0 and abs(total - 1.0) > 1e-6: + dist /= total + + return dist + + # Fallback: uniform distribution (no context matches) + return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size + + def _get_lower_order_dist(self, tokens: np.ndarray, position: int, order: int) -> np.ndarray: + """Get distribution for a lower-order context (recursive backoff).""" + if order == 0: + # Unigram: use continuation counts or uniform + if self.trie.total_tokens == 0: + return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size + + # Simple unigram from counts + dist = np.zeros(self.vocab_size, dtype=np.float64) + unigram_children = self.trie.get_children([]) + total = sum(unigram_children.values()) + if total == 0: + return np.ones(self.vocab_size, dtype=np.float64) / self.vocab_size + + for tok_id, count in unigram_children.items(): + dist[tok_id] = count / total + + # Add small floor for unseen tokens + floor = 1e-10 + dist = dist + floor + dist /= dist.sum() + return dist + + context = tokens[position - order:position].tolist() + context_count = self.trie.get_count(context) + + if context_count == 0: + return self._get_lower_order_dist(tokens, position, order - 1) + + children = self.trie.get_children(context) + num_unique = len(children) + + dist = np.zeros(self.vocab_size, dtype=np.float64) + for tok_id in range(self.vocab_size): + tok_count = children.get(tok_id, 0) + dist[tok_id] = max(tok_count - self.discount, 0.0) / context_count + + backoff_weight = (self.discount * num_unique) / context_count + lower = self._get_lower_order_dist(tokens, position, order - 1) + dist = dist + backoff_weight * lower + + total = dist.sum() + if total > 0: + dist /= total + + return dist + + +# ============================================================ +# Evaluation +# ============================================================ + +def load_val_tokens(path: str) -> np.ndarray: + """Load validation tokens from a .bin shard.""" + raw = np.fromfile(path, dtype=np.uint8) + # Header: 256 int32s + header = np.frombuffer(raw[:1024], dtype=np.int32) + magic, version, num_tokens = header[0], header[1], header[2] + assert magic == 20240520, f"Bad magic: {magic}" + tokens = np.frombuffer(raw[1024:1024 + num_tokens * 2], dtype=np.uint16).astype(np.int64) + return tokens + + +def evaluate_ngram_standalone(val_tokens: np.ndarray, max_order: int = 5, + vocab_size: int = 1024, max_positions: int = 100000, + discount: float = 0.75): + """ + Evaluate a standalone Kneser-Ney n-gram LM on val tokens. + Score-first: at each position, score with current model, then update. + """ + ngram = KneserNeyNgram(max_order=max_order, vocab_size=vocab_size, discount=discount) + + total_positions = min(len(val_tokens) - 1, max_positions) + total_nll = 0.0 + normalization_errors = 0 + max_norm_error = 0.0 + + t0 = time.perf_counter() + + for pos in range(total_positions): + target = int(val_tokens[pos + 1]) + + # Score first (using only tokens 0..pos) + if pos < 1: + # No context yet — uniform + prob = 1.0 / vocab_size + else: + dist = ngram.get_distribution(val_tokens, pos + 1) + + # Verify normalization + dist_sum = dist.sum() + norm_error = abs(dist_sum - 1.0) + max_norm_error = max(max_norm_error, norm_error) + if norm_error > 1e-4: + normalization_errors += 1 + + prob = dist[target] + + prob = max(prob, 1e-12) # floor to avoid log(0) + total_nll += -math.log(prob) + + # Update after scoring + ngram.update(val_tokens, pos + 1) + + if (pos + 1) % 10000 == 0: + elapsed = time.perf_counter() - t0 + avg_nll = total_nll / (pos + 1) + bpb_est = avg_nll / math.log(2) * (vocab_size / 1024) # rough BPB + speed = (pos + 1) / elapsed + print(f" pos {pos+1}/{total_positions}: avg_nll={avg_nll:.4f} " + f"speed={speed:.0f} tok/s max_norm_err={max_norm_error:.2e} " + f"norm_violations={normalization_errors}") + + avg_nll = total_nll / total_positions + avg_bpb_approx = avg_nll / math.log(2) + elapsed = time.perf_counter() - t0 + + print(f"\n{'='*60}") + print(f"Kneser-Ney N-gram (order={max_order}, discount={discount})") + print(f"Positions evaluated: {total_positions}") + print(f"Average NLL: {avg_nll:.6f}") + print(f"Approx BPB (token-level): {avg_bpb_approx:.6f}") + print(f"Max normalization error: {max_norm_error:.2e}") + print(f"Normalization violations (>1e-4): {normalization_errors}") + print(f"Time: {elapsed:.1f}s ({total_positions/elapsed:.0f} tok/s)") + print(f"{'='*60}") + + return avg_nll, avg_bpb_approx + + +def evaluate_mixed(val_tokens: np.ndarray, model_nll: np.ndarray, + max_order: int = 5, vocab_size: int = 1024, + alphas: list[float] = [0.01, 0.05, 0.10, 0.20, 0.30], + max_positions: int = 100000, discount: float = 0.75): + """ + Evaluate mixed distribution: p_mixed = (1-alpha)*p_model + alpha*p_ngram + + model_nll: per-position NLL from the neural model (precomputed) + + This is the key experiment: does mixing help? + """ + ngram = KneserNeyNgram(max_order=max_order, vocab_size=vocab_size, discount=discount) + + total_positions = min(len(val_tokens) - 1, min(len(model_nll), max_positions)) + + # Track NLL for each alpha + nll_sums = {alpha: 0.0 for alpha in alphas} + nll_baseline = 0.0 # model-only + norm_errors = 0 + + t0 = time.perf_counter() + + for pos in range(total_positions): + target = int(val_tokens[pos + 1]) + p_model_target = math.exp(-model_nll[pos]) # model's prob for correct token + + nll_baseline += model_nll[pos] + + if pos < 1: + # No n-gram context — mixed = model + for alpha in alphas: + nll_sums[alpha] += model_nll[pos] + else: + dist_ngram = ngram.get_distribution(val_tokens, pos + 1) + + # Verify normalization + dist_sum = dist_ngram.sum() + if abs(dist_sum - 1.0) > 1e-4: + norm_errors += 1 + + p_ngram_target = dist_ngram[target] + + for alpha in alphas: + p_mixed = (1.0 - alpha) * p_model_target + alpha * p_ngram_target + p_mixed = max(p_mixed, 1e-12) + nll_sums[alpha] += -math.log(p_mixed) + + # Update after scoring + ngram.update(val_tokens, pos + 1) + + if (pos + 1) % 10000 == 0: + elapsed = time.perf_counter() - t0 + print(f" pos {pos+1}/{total_positions}: " + f"baseline_nll={nll_baseline/(pos+1):.4f} " + f"best_mixed_nll={min(v/(pos+1) for v in nll_sums.values()):.4f} " + f"speed={((pos+1)/elapsed):.0f} tok/s") + + elapsed = time.perf_counter() - t0 + + print(f"\n{'='*60}") + print(f"MIXED DISTRIBUTION RESULTS (Kneser-Ney order={max_order}, d={discount})") + print(f"Positions: {total_positions} | Norm violations: {norm_errors}") + print(f"Time: {elapsed:.1f}s") + print(f"{'='*60}") + print(f"{'Alpha':<10} {'Avg NLL':<12} {'Approx BPB':<12} {'Delta NLL':<12} {'Delta BPB':<12}") + print(f"{'-'*58}") + + baseline_avg = nll_baseline / total_positions + baseline_bpb = baseline_avg / math.log(2) + print(f"{'model':10} {baseline_avg:<12.6f} {baseline_bpb:<12.6f} {'—':12} {'—':12}") + + for alpha in alphas: + avg = nll_sums[alpha] / total_positions + bpb = avg / math.log(2) + delta_nll = avg - baseline_avg + delta_bpb = bpb - baseline_bpb + marker = " ***" if delta_nll < -0.0001 else "" + print(f"{alpha:<10.2f} {avg:<12.6f} {bpb:<12.6f} {delta_nll:<+12.6f} {delta_bpb:<+12.6f}{marker}") + + print(f"{'='*60}") + + best_alpha = min(alphas, key=lambda a: nll_sums[a]) + best_delta = (nll_sums[best_alpha] / total_positions) - baseline_avg + print(f"\nBest alpha: {best_alpha} (delta NLL: {best_delta:+.6f})") + if abs(best_delta) < 0.001: + print("CONCLUSION: N-gram provides negligible improvement (<0.001 NLL)") + elif best_delta < 0: + print(f"CONCLUSION: N-gram provides real improvement of {-best_delta:.4f} NLL") + else: + print("CONCLUSION: N-gram HURTS — model-only is better") + + +def main(): + parser = argparse.ArgumentParser(description="Definitive n-gram normalization test") + parser.add_argument("--val-path", type=str, required=True, help="Path to val .bin shard") + parser.add_argument("--model-nll-path", type=str, default="", + help="Path to precomputed model NLL (.npy). If empty, runs standalone only.") + parser.add_argument("--max-order", type=int, default=5, help="Max n-gram order") + parser.add_argument("--max-positions", type=int, default=100000, + help="Max positions to evaluate (default 100K for speed)") + parser.add_argument("--discount", type=float, default=0.75, help="Kneser-Ney discount") + parser.add_argument("--vocab-size", type=int, default=1024) + parser.add_argument("--standalone", action="store_true", help="Run standalone n-gram eval only") + parser.add_argument("--alphas", type=str, default="0.01,0.05,0.10,0.20,0.30,0.50", + help="Comma-separated alpha values for mixing") + args = parser.parse_args() + + print(f"Loading val tokens from {args.val_path}...") + val_tokens = load_val_tokens(args.val_path) + print(f"Loaded {len(val_tokens)} tokens") + + if args.standalone: + print(f"\n=== Standalone Kneser-Ney N-gram (order={args.max_order}) ===") + evaluate_ngram_standalone( + val_tokens, max_order=args.max_order, vocab_size=args.vocab_size, + max_positions=args.max_positions, discount=args.discount, + ) + elif args.model_nll_path: + print(f"\nLoading model NLL from {args.model_nll_path}...") + model_nll = np.load(args.model_nll_path) + print(f"Loaded {len(model_nll)} NLL values") + + alphas = [float(a) for a in args.alphas.split(",")] + + print(f"\n=== Mixed Distribution Test ===") + evaluate_mixed( + val_tokens, model_nll, max_order=args.max_order, + vocab_size=args.vocab_size, alphas=alphas, + max_positions=args.max_positions, discount=args.discount, + ) + else: + print("ERROR: Provide --model-nll-path for mixed eval, or --standalone for n-gram only") + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py new file mode 100644 index 0000000000..5593c626cf --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/online_logit_bias.py @@ -0,0 +1,295 @@ +""" +Online Logit Bias Adaptation for eval-time BPB improvement. + +After scoring each token, updates a learnable bias vector in logit space +using the gradient of cross-entropy loss. This corrects systematic biases +from quantization and adapts to document-level token frequency shifts. + +Legality: +- Score-before-update: bias at position t only uses gradients from positions 1..t-1 +- Full normalized distribution: softmax(logits + bias) sums to 1.0 by construction +- Single left-to-right pass: no rescoring +- Causal: no future token information used (gradient uses only the scored target) + +Integration: + # In eval loop, after getting logits from model: + olb = OnlineLogitBias(vocab_size=1024, lr=0.01) + for each position: + adjusted_logits = logits + olb.bias + score = -log(softmax(adjusted_logits)[target]) # this is the final score + olb.update(adjusted_logits, target) # update AFTER scoring +""" + +import math +import numpy as np +try: + import torch + import torch.nn.functional as F + from torch import Tensor + _HAS_TORCH = True +except ImportError: + _HAS_TORCH = False + + +class OnlineLogitBias: + """ + Maintains a learnable bias vector added to logits, updated online via SGD. + + After scoring token t with logits + bias, the gradient of CE loss w.r.t. bias is: + grad = softmax(logits + bias) - one_hot(target) + + This is the simplest possible online adaptation — no backprop through the model. + """ + + def __init__(self, vocab_size: int = 1024, lr: float = 0.01, + momentum: float = 0.9, weight_decay: float = 0.0, + device: str = "cpu"): + self.vocab_size = vocab_size + self.lr = lr + self.momentum = momentum + self.weight_decay = weight_decay + self.bias = np.zeros(vocab_size, dtype=np.float64) + self.velocity = np.zeros(vocab_size, dtype=np.float64) + self.step_count = 0 + + def get_bias(self) -> np.ndarray: + """Get current bias vector.""" + return self.bias + + def update(self, logits: np.ndarray, target: int): + """ + Update bias after scoring. Must be called AFTER the score is recorded. + + Args: + logits: raw model logits at this position, shape (vocab_size,) + target: the correct token ID (already scored) + """ + # Compute softmax(logits + bias) + adjusted = logits + self.bias + adjusted -= adjusted.max() # numerical stability + exp_adj = np.exp(adjusted) + probs = exp_adj / exp_adj.sum() + + # Gradient of CE loss w.r.t. bias: softmax - one_hot + grad = probs.copy() + grad[target] -= 1.0 + + # Weight decay + if self.weight_decay > 0: + grad += self.weight_decay * self.bias + + # SGD with momentum + self.velocity = self.momentum * self.velocity + grad + self.bias -= self.lr * self.velocity + self.step_count += 1 + + +class OnlineLogitBiasPerDocument(OnlineLogitBias): + """ + Resets the bias at document boundaries (BOS tokens). + This prevents cross-document contamination and allows + fresh adaptation for each document's token distribution. + """ + + def __init__(self, vocab_size: int = 1024, lr: float = 0.01, + momentum: float = 0.9, weight_decay: float = 0.001, + bos_token: int = 0, reset_on_bos: bool = True, + device: str = "cpu"): + super().__init__(vocab_size, lr, momentum, weight_decay, device) + self.bos_token = bos_token + self.reset_on_bos = reset_on_bos + + def update(self, logits: np.ndarray, target: int): + """Update bias, resetting at document boundaries.""" + if self.reset_on_bos and target == self.bos_token: + self.bias[:] = 0.0 + self.velocity[:] = 0.0 + self.step_count = 0 + super().update(logits, target) + + +def eval_with_online_logit_bias( + model_logits_fn, # callable: (x_batch) -> logits tensor + val_tokens: np.ndarray, + vocab_size: int = 1024, + seq_len: int = 2048, + stride: int = 64, + lr: float = 0.01, + momentum: float = 0.9, + weight_decay: float = 0.0, + max_positions: int = 0, # 0 = all + device: str = "cuda", + log_fn=print, +) -> tuple[float, float, int]: + """ + Sliding-window eval with online logit bias adaptation. + + Returns (avg_nll, approx_bpb, num_positions) + """ + import time + import torch + + total = len(val_tokens) - 1 + if max_positions > 0: + total = min(total, max_positions) + + olb = OnlineLogitBias(vocab_size=vocab_size, lr=lr, momentum=momentum, + weight_decay=weight_decay) + + # We need per-position logits. For efficiency, process in sliding windows + # but track which positions have been scored. + window_starts = [ws for ws in range(0, total, stride) + if min(ws + seq_len, total) - ws >= 1] + + scored = np.zeros(total, dtype=bool) + nll_values = np.zeros(total, dtype=np.float64) + nll_baseline = np.zeros(total, dtype=np.float64) + + t0 = time.perf_counter() + batch_size = 32 + + val_tensor = torch.from_numpy(val_tokens.astype(np.int64)) + + with torch.inference_mode(): + for bi in range(0, len(window_starts), batch_size): + batch_ws = window_starts[bi:bi + batch_size] + bsz = len(batch_ws) + xb = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + yb = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens = [] + + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total) + wlen = end - ws + wlens.append(wlen) + chunk = val_tensor[ws:end + 1].to(device=device) + xb[i, :wlen] = chunk[:-1] + yb[i, :wlen] = chunk[1:] + + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = model_logits_fn(xb) + + logits_f = logits.float().cpu().numpy() + yb_np = yb.cpu().numpy() + + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + + for j in range(s, wlen): + pos = ws + j + if pos >= total or scored[pos]: + continue + + target = int(yb_np[i, j]) + pos_logits = logits_f[i, j, :vocab_size] + + # Score WITH bias (the reported score) + adjusted = pos_logits + olb.get_bias() + adjusted_shifted = adjusted - adjusted.max() + log_sum_exp = np.log(np.exp(adjusted_shifted).sum()) + nll_with_bias = -(adjusted_shifted[target] - log_sum_exp) + + # Score WITHOUT bias (baseline) + baseline_shifted = pos_logits - pos_logits.max() + log_sum_exp_b = np.log(np.exp(baseline_shifted).sum()) + nll_no_bias = -(baseline_shifted[target] - log_sum_exp_b) + + nll_values[pos] = nll_with_bias + nll_baseline[pos] = nll_no_bias + scored[pos] = True + + # Update AFTER scoring + olb.update(pos_logits, target) + + if bi % (batch_size * 50) == 0 and bi > 0: + n_scored = scored.sum() + avg_nll = nll_values[scored].mean() + avg_base = nll_baseline[scored].mean() + delta = avg_nll - avg_base + elapsed = time.perf_counter() - t0 + log_fn(f" scored {n_scored}/{total}: " + f"baseline_nll={avg_base:.6f} olb_nll={avg_nll:.6f} " + f"delta={delta:+.6f} speed={n_scored/elapsed:.0f}/s") + + n_scored = int(scored.sum()) + avg_nll = float(nll_values[scored].mean()) + avg_base = float(nll_baseline[scored].mean()) + delta = avg_nll - avg_base + bpb = avg_nll / math.log(2) + bpb_base = avg_base / math.log(2) + + log_fn(f"\n{'='*60}") + log_fn(f"ONLINE LOGIT BIAS RESULTS (lr={lr}, mom={momentum}, wd={weight_decay})") + log_fn(f"Positions scored: {n_scored}") + log_fn(f"Baseline NLL: {avg_base:.6f} (BPB: {bpb_base:.6f})") + log_fn(f"With OLB NLL: {avg_nll:.6f} (BPB: {bpb:.6f})") + log_fn(f"Delta NLL: {delta:+.6f} (Delta BPB: {delta/math.log(2):+.6f})") + log_fn(f"Time: {time.perf_counter()-t0:.1f}s") + log_fn(f"{'='*60}") + + return avg_nll, bpb, n_scored + + +# ============================================================ +# Standalone test (CPU, no model — synthetic logits) +# ============================================================ + +def test_synthetic(): + """Test OLB on synthetic data to verify correctness.""" + print("=== Synthetic OLB Test ===") + np.random.seed(42) + vocab = 1024 + N = 50000 + + # Simulate: tokens come from a distribution that shifts over time + # First half: tokens 0-100 are common + # Second half: tokens 500-600 are common + tokens = np.zeros(N, dtype=np.int64) + for i in range(N): + if i < N // 2: + tokens[i] = np.random.choice(100) + else: + tokens[i] = 500 + np.random.choice(100) + + # Simulate model logits: uniform (bad model that doesn't know the distribution) + olb = OnlineLogitBias(vocab_size=vocab, lr=0.05, momentum=0.9, weight_decay=0.001) + + nll_no_bias = 0.0 + nll_with_bias = 0.0 + uniform_nll = math.log(vocab) + + for i in range(1, N): + target = tokens[i] + logits = np.zeros(vocab) # uniform model + + # Score with bias + adjusted = logits + olb.bias + adjusted -= adjusted.max() + probs = np.exp(adjusted) / np.exp(adjusted).sum() + nll_with_bias += -math.log(max(probs[target], 1e-12)) + + # Score without bias + nll_no_bias += uniform_nll + + # Update after scoring + olb.update(logits, target) + + avg_no = nll_no_bias / (N - 1) + avg_with = nll_with_bias / (N - 1) + print(f"Uniform model NLL: {avg_no:.4f}") + print(f"With OLB NLL: {avg_with:.4f}") + print(f"Delta: {avg_with - avg_no:+.4f}") + print(f"OLB learned to track the shifting distribution!") + print() + + # Verify bias reflects the distribution shift + top_first_half = olb.bias[:100].mean() + top_second_half = olb.bias[500:600].mean() + print(f"Bias for tokens 0-99 (first half common): {top_first_half:+.4f}") + print(f"Bias for tokens 500-599 (second half common): {top_second_half:+.4f}") + print(f"Second half tokens should have higher bias (they were seen more recently)") + + +if __name__ == "__main__": + test_synthetic() diff --git a/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py new file mode 100644 index 0000000000..9a65d9343e --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_Negative_Results_Comprehensive/retokenize_proper.py @@ -0,0 +1,157 @@ +"""Proper Scylla retokenization: split train/val from raw docs, no SP1024 roundtrip. +Matches the official manifest: shuffle with seed 1337, last 50K docs = val. +Memory-efficient: workers read from disk, not from in-memory lists.""" +import json +import os +import sys +import time +import random +import math +import numpy as np +from pathlib import Path +from multiprocessing import Process, Queue + +HEADER_INTS = 256 +HEADER_MAGIC = 20240520 +HEADER_VERSION = 1 +TOKENS_PER_SHARD = 100_000_000 +NUM_VAL_DOCS = 50000 +SHUFFLE_SEED = 1337 + + +def write_shard(path, tokens): + header = np.zeros(HEADER_INTS, dtype="= TOKENS_PER_SHARD: + shard_tokens = np.array(buffer[:TOKENS_PER_SHARD], dtype=np.uint16) + write_shard( + Path(out_dir) / f"fineweb_train_w{worker_id:02d}_{shard_count:04d}.bin", + shard_tokens, + ) + buffer = buffer[TOKENS_PER_SHARD:] + shard_count += 1 + + if buffer: + write_shard( + Path(out_dir) / f"fineweb_train_w{worker_id:02d}_{shard_count:04d}.bin", + np.array(buffer, dtype=np.uint16), + ) + shard_count += 1 + + result_queue.put((worker_id, shard_count)) + print(f" Worker {worker_id}: {shard_count} shards", flush=True) + + +def main(): + vocab_path = os.environ.get("VOCAB_PATH", "/workspace/candidate.vocab") + docs_path = os.environ.get("DOCS_PATH", "/workspace/raw_docs/datasets/docs_selected.jsonl") + out_dir = Path(os.environ.get("OUTPUT_DIR", "/workspace/fineweb_scylla")) + num_workers = int(os.environ.get("NUM_WORKERS", "16")) + + out_dir.mkdir(parents=True, exist_ok=True) + + # --- Step 1: Count lines and determine split --- + print("Counting docs...", flush=True) + t0 = time.time() + total_lines = 0 + with open(docs_path, "r") as f: + for _ in f: + total_lines += 1 + print(f"Total: {total_lines} docs in {time.time()-t0:.0f}s", flush=True) + + # Shuffle indices (matching official manifest) + print(f"Shuffling with seed {SHUFFLE_SEED}...", flush=True) + indices = list(range(total_lines)) + random.seed(SHUFFLE_SEED) + random.shuffle(indices) + + val_indices = set(indices[-NUM_VAL_DOCS:]) + train_indices = indices[:-NUM_VAL_DOCS] + print(f"Train: {len(train_indices)} docs, Val: {len(val_indices)} docs", flush=True) + + # --- Step 2: Tokenize val (single process) --- + print("Tokenizing val docs...", flush=True) + import tokenmonster + vocab = tokenmonster.load(vocab_path) + val_tokens = [] + with open(docs_path, "r") as f: + for line_num, line in enumerate(f): + if line_num not in val_indices: + continue + doc = json.loads(line) + text = doc.get("text", "") + if text: + val_tokens.extend(vocab.tokenize(text)) + write_shard(out_dir / "fineweb_val_000000.bin", np.array(val_tokens, dtype=np.uint16)) + print(f"Val: {len(val_tokens)} tokens", flush=True) + del val_tokens, vocab + + # --- Step 3: Split train indices into contiguous chunks for workers --- + # Sort train indices so each worker processes docs in file order (fast sequential read) + train_indices.sort() + chunk_size = math.ceil(len(train_indices) / num_workers) + chunks = [train_indices[i*chunk_size:(i+1)*chunk_size] for i in range(num_workers)] + del train_indices + + # --- Step 4: Launch parallel workers --- + print(f"Tokenizing train with {num_workers} workers...", flush=True) + t0 = time.time() + result_queue = Queue() + workers = [] + for i in range(num_workers): + p = Process(target=worker_fn, args=(i, chunks[i], docs_path, vocab_path, str(out_dir), result_queue)) + p.start() + workers.append(p) + + for p in workers: + p.join() + + results = [] + while not result_queue.empty(): + results.append(result_queue.get()) + results.sort() + total_shards = sum(r[1] for r in results) + print(f"Workers done: {total_shards} shards in {time.time()-t0:.0f}s", flush=True) + + # --- Step 5: Rename to sequential --- + shard_files = [] + for wid in range(num_workers): + worker_files = sorted(out_dir.glob(f"fineweb_train_w{wid:02d}_*.bin")) + shard_files.extend(worker_files) + + for idx, f in enumerate(shard_files): + f.rename(f.parent / f"fineweb_train_{idx:06d}.bin") + + print(f"Done! {len(shard_files)} train + 1 val shards", flush=True) + + +if __name__ == "__main__": + main()