openai · skoustav35 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/.gitignore b/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/.gitignore
@@ -0,0 +1,3 @@
+__pycache__/
+*.pyc
+*.pyo
diff --git a/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/README.md b/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/README.md
@@ -0,0 +1,104 @@
+# Record: LeakyReLU² + Legal Score‑First TTT + N‑gram Backoff Cache + Gated Attention
+
+**val_bpb = 0.9641** (3‑seed mean, std 0.0007)
+
+## Results (3‑seed validation)
+
+| Seed | val\_bpb | val\_loss | Artifact Size | Train Steps | Train Time |
+|------|---------|----------|--------------|-------------|------------|
+| 1337 | 0.9642  | 1.6274   | 15,982,044 B | 7,185       | 599,384 ms |
+| 42   | 0.9648  | 1.6285   | 15,977,267 B | 7,182       | 599,761 ms |
+| 2025 | 0.9634  | 1.6261   | 15,989,583 B | 7,196       | 599,618 ms |
+| **Mean** | **0.9641** | **1.62735** | — | — | — |
+| **Std**  | **0.0007** | — | — | — | — |
+
+**Statistical significance**: mean 0.9641 bpb (1.6274 nats) vs current merged top 1.1147 bpb (1.8822 nats, [PR #1019](https://github.com/openai/parameter-golf/pull/1019)) → Δ = −0.2548 nats, Welch t = −328.3, df = 2.93, p ≪ 0.01. Required improvement threshold ≥ 0.005 nats ([official rule](https://github.com/openai/parameter-golf/blob/main/README.md#L191)); this Δ exceeds it by 51×.
+
+## Technique
+
+- **Architecture**: 11L, 512d, GQA 8H/4KV, MLP 3×, LeakyReLU(0.5)², XSA‑5 (layers 6–10), Value Residual, Gated Attention, SmearGate, VE(128) on layers 8/9/10, BigramHash(2048), Partial RoPE(16/64), LN Scale, MTP‑2, EMA(0.9985). Tied embeddings. Muon optimizer.
+- **N‑gram eval cache** (community precedent: [PR #727](https://github.com/openai/parameter-golf/pull/727)):
+  - Multi‑order backoff (orders 2–9): highest matching order wins, cascade down on miss.
+  - Laplace (add‑1) smoothing: returns a valid probability for any context match, even if the target token was never seen. The scored probability does **not** depend on oracle knowledge of the target.
+  - Entropy‑adaptive alpha: `α = 0.08 + 0.65 × σ(2 × (H − 3.5))`. High entropy → trust n‑gram more; low entropy → trust neural model.
+  - Zero artifact cost: cache is built entirely at eval time from already‑scored tokens. No stored weights or tables.
+  - Score‑first, backward‑looking: `ngram_cache.update()` is called only *after* scoring each chunk.
+- **Legal score‑first TTT**: SGD (lr=0.002, momentum=0.9), 3 epochs, 32K‑token chunks, stride 64, cosine LR decay.
+- **Quantization**: int6 per‑row + lzma compression. CROWN‑Q penalty during late training.
+- **New optional upgrades in `train_gpt.py`** (off by default to preserve the reported baseline numbers): Mixture-of-Depth style token routing (`MOD_*` flags), SquareGLU gated MLP (`SQUAREGLU_ENABLED` + `mlp_gate_bank`), EMA warmdown self-distillation (`EMA_DISTILL_*`), and Grokfast gradient low-pass (`GROKFAST_*`).
+
+## Compliance
+
+- Training time: all seeds ≤ 600,000 ms (599,384 / 599,761 / 599,618). **Note**: the logged `train_time` starts after 20 warmup steps and model compilation. If the challenge judges end‑to‑end wallclock (including compile + warmup), the actual margin is narrower than these numbers suggest.
+- Artifact size: all seeds ≤ 16,000,000 B (15,982,044 / 15,977,267 / 15,989,583).
+- Score‑first TTT: each validation token is scored under `torch.inference_mode()` before any model update.
+- N‑gram cache legality: **contested**. The cache is backward‑looking only, uses zero artifact bytes, and produces Laplace‑smoothed probabilities that form a proper normalized distribution. [PR #727](https://github.com/openai/parameter-golf/pull/727) (closed, 0.9674 bpb) used the same technique and spawned followup PRs (#753, #778, #782, #786). However, OpenAI opened [issue #677](https://github.com/openai/parameter-golf/issues/677) on 2026‑03‑25 questioning the legality of eval‑time cache methods. This submission may face review scrutiny regardless of score validity.
+- Phase‑1 TTT (`TTT_PHASE1_ENABLED`): disabled by default (rule‑violating).
+- No network access during training or eval beyond local `nvidia-smi`.
+
+## Reproduce
+
+The script auto‑resolves data paths relative to the repo root (via `_REPO_ROOT`), so it works from both the repo root and from within `records/track_10min_16mb/<submission>/`.
+
+```bash
+# From repo root after cloning:
+cd parameter-golf
+python3 data/cached_challenge_fineweb.py --variant sp1024
+
+# Seed 1337
+SEED=1337 RUN_ID=seed_1337 VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py
+
+# Seed 42
+SEED=42 RUN_ID=seed_42 VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py
+
+# Seed 2025
+SEED=2025 RUN_ID=seed_2025 VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py
+```
+
+Alternatively, override paths explicitly:
+```bash
+DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+SEED=1337 VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Hardware: 8× H100 SXM (RunPod), CUDA 12.8, PyTorch 2.9+.
+
+## Key Environment Variables
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `MAX_WALLCLOCK_SECONDS` | 600.0 | Hard training time cap (seconds) |
+| `NGRAM_CACHE` | 1 | Enable N‑gram backoff cache |
+| `NGRAM_ORDER` | 9 | Max n‑gram order |
+| `TTT_ENABLED` | 1 | Enable legal score‑first TTT |
+| `TTT_PHASE1_ENABLED` | 0 | **Off** — violates rules if enabled |
+| `XSA_LAST_N` | 5 | Layers using exclusive self‑attention |
+| `VE_ENABLED` | 1 | Value embedding on layers 8/9/10 |
+| `QAT_ENABLED` | 0 | Quantization‑aware training |
+| `MOD_ENABLED` | 0 | Enable token routing masks in attention/MLP blocks |
+| `SQUAREGLU_ENABLED` | 0 | Use SquareGLU gated MLP path |
+| `EMA_DISTILL_ENABLED` | 0 | Enable EMA teacher distillation in warmdown |
+| `GROKFAST_ENABLED` | 0 | Enable Grokfast gradient low-pass filtering |
+
+## Eval Timing Budget (8×H100)
+
+| Phase | Time |
+|-------|------|
+| Training (wallclock‑capped) | ≤ 600 s |
+| Standard eval (int6 roundtrip + sliding window s64) | ~82 s |
+| Legal TTT + N‑gram cache | ~420 s |
+| **Total eval (timed phases)** | **~502 s** |
+
+**Note**: the ~502 s figure covers only the timed eval phases. `torch.compile` warmup and model deserialization add additional overhead (~5–15 s) that occurs outside these timed blocks. Total end‑to‑end eval is estimated at ~515–520 s.
+
+## Credits
+
+Built on [modded‑nanogpt](https://github.com/KellerJordan/modded-nanogpt). Key technique credits: [PR #727](https://github.com/openai/parameter-golf/pull/727) (N‑gram backoff + entropy‑adaptive alpha), [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² + TTT + Muon SOTA stack), [PR #461](https://github.com/openai/parameter-golf/pull/461) (score‑first TTT protocol), [PR #659](https://github.com/openai/parameter-golf/pull/659) (original N‑gram cache).
diff --git a/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/requirements.txt b/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/requirements.txt
@@ -0,0 +1,9 @@
+# Parameter Golf - train_gpt.py dependencies
+# PyTorch 2.9+ with CUDA 12.8 (8xH100 SXM target)
+torch>=2.9.0
+numpy>=1.26.0
+sentencepiece>=0.2.0
+
+# Optional but recommended for best performance
+zstandard>=0.23.0        # zstd compression (falls back to zlib if missing)
+flash-attn-hopper>=2.0   # FlashAttention 3 for Hopper GPUs (falls back to SDPA if missing)
diff --git a/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/submission.json b/records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/submission.json
@@ -0,0 +1,60 @@
+{
+  "author": "Koustav Sarkar",
+  "github_id": "skoustav",
+  "name": "LeakyReLU² + Legal Score-First TTT + N-gram Backoff Cache + Gated Attention",
+  "blurb": "11L XSA-5 + Score-first TTT (SGD, 3ep, 32K chunks, stride 64) + multi-order N-gram backoff cache (order 9, Laplace-smoothed, entropy-adaptive alpha) + gated attention + value residual + VE(128) on layers 8,9,10 + MTP-2 + BigramHash(2048) + CROWN-Q + LeakyReLU(0.5)². N-gram cache uses community-contested eval-time technique (see PR #727, issue #677). 3-seed exact mean: 0.96412237 BPB / 1.62735261 nats.",
+  "date": "2026-03-30",
+  "track": "10min_16mb",
+  "val_loss": 1.62735261,
+  "val_bpb": 0.96412237,
+  "val_loss_std": 0.00120441,
+  "val_bpb_std": 0.00071386,
+  "seeds": [1337, 42, 2025],
+  "seed_results": {
+    "1337": {
+      "val_loss": 1.62740282,
+      "val_bpb": 0.96415208,
+      "artifact_bytes": 15982044,
+      "steps": 7185,
+      "step_avg_ms": 83.42,
+      "train_time_ms": 599384
+    },
+    "42": {
+      "val_loss": 1.62853112,
+      "val_bpb": 0.96482092,
+      "artifact_bytes": 15977267,
+      "steps": 7182,
+      "step_avg_ms": 83.51,
+      "train_time_ms": 599761
+    },
+    "2025": {
+      "val_loss": 1.62612388,
+      "val_bpb": 0.96339412,
+      "artifact_bytes": 15989583,
+      "steps": 7196,
+      "step_avg_ms": 83.32,
+      "train_time_ms": 599618
+    }
+  },
+  "comparison_baseline_pr": 1019,
+  "comparison_baseline_bpb": 1.11473509,
+  "comparison_baseline_nats": 1.88217853,
+  "implementation_lineage_pr": 727,
+  "delta_vs_baseline_nats": -0.25482592,
+  "delta_vs_baseline_bpb": -0.15061272,
+  "t_statistic": -328.29,
+  "welch_df": 2.93,
+  "artifact_bytes_mean": 15982965,
+  "artifact_bytes_max": 15989583,
+  "bytes_total": 15989583,
+  "bytes_code": 115769,
+  "bytes_model_max": 15873814,
+  "train_steps_mean": 7187.67,
+  "step_avg_ms_mean": 83.42,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "cuda_version": "12.8",
+  "flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
+  "ngram_cache_note": "Eval-time N-gram backoff cache with Laplace smoothing. Legality is community-contested: see PR #727 (closed), issue #677 (open). Cache uses zero artifact bytes, is backward-looking only, and forms a proper normalized distribution via add-1 smoothing.",
+  "technique_summary": "Score-first TTT + N-gram backoff cache (order 9) + Gated Attention + Value Residual + XSA-5 + VE + MTP-2 + BigramHash 2048 + CROWN-Q + LeakyReLU²; codebase additionally includes opt-in MoD routing, SquareGLU, EMA distillation, and Grokfast toggles (disabled in reported baseline)."
+}