Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__pycache__/
*.pyc
*.pyo
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Record: LeakyReLU² + Legal Score‑First TTT + N‑gram Backoff Cache + Gated Attention

**val_bpb = 0.9641** (3‑seed mean, std 0.0007)

## Results (3‑seed validation)

| Seed | val\_bpb | val\_loss | Artifact Size | Train Steps | Train Time |
|------|---------|----------|--------------|-------------|------------|
| 1337 | 0.9642 | 1.6274 | 15,982,044 B | 7,185 | 599,384 ms |
| 42 | 0.9648 | 1.6285 | 15,977,267 B | 7,182 | 599,761 ms |
| 2025 | 0.9634 | 1.6261 | 15,989,583 B | 7,196 | 599,618 ms |
| **Mean** | **0.9641** | **1.62735** | — | — | — |
| **Std** | **0.0007** | — | — | — | — |

**Statistical significance**: mean 0.9641 bpb (1.6274 nats) vs current merged top 1.1147 bpb (1.8822 nats, [PR #1019](https://github.com/openai/parameter-golf/pull/1019)) → Δ = −0.2548 nats, Welch t = −328.3, df = 2.93, p ≪ 0.01. Required improvement threshold ≥ 0.005 nats ([official rule](https://github.com/openai/parameter-golf/blob/main/README.md#L191)); this Δ exceeds it by 51×.

## Technique

- **Architecture**: 11L, 512d, GQA 8H/4KV, MLP 3×, LeakyReLU(0.5)², XSA‑5 (layers 6–10), Value Residual, Gated Attention, SmearGate, VE(128) on layers 8/9/10, BigramHash(2048), Partial RoPE(16/64), LN Scale, MTP‑2, EMA(0.9985). Tied embeddings. Muon optimizer.
- **N‑gram eval cache** (community precedent: [PR #727](https://github.com/openai/parameter-golf/pull/727)):
- Multi‑order backoff (orders 2–9): highest matching order wins, cascade down on miss.
- Laplace (add‑1) smoothing: returns a valid probability for any context match, even if the target token was never seen. The scored probability does **not** depend on oracle knowledge of the target.
- Entropy‑adaptive alpha: `α = 0.08 + 0.65 × σ(2 × (H − 3.5))`. High entropy → trust n‑gram more; low entropy → trust neural model.
- Zero artifact cost: cache is built entirely at eval time from already‑scored tokens. No stored weights or tables.
- Score‑first, backward‑looking: `ngram_cache.update()` is called only *after* scoring each chunk.
- **Legal score‑first TTT**: SGD (lr=0.002, momentum=0.9), 3 epochs, 32K‑token chunks, stride 64, cosine LR decay.
- **Quantization**: int6 per‑row + lzma compression. CROWN‑Q penalty during late training.
- **New optional upgrades in `train_gpt.py`** (off by default to preserve the reported baseline numbers): Mixture-of-Depth style token routing (`MOD_*` flags), SquareGLU gated MLP (`SQUAREGLU_ENABLED` + `mlp_gate_bank`), EMA warmdown self-distillation (`EMA_DISTILL_*`), and Grokfast gradient low-pass (`GROKFAST_*`).

## Compliance

- Training time: all seeds ≤ 600,000 ms (599,384 / 599,761 / 599,618). **Note**: the logged `train_time` starts after 20 warmup steps and model compilation. If the challenge judges end‑to‑end wallclock (including compile + warmup), the actual margin is narrower than these numbers suggest.
- Artifact size: all seeds ≤ 16,000,000 B (15,982,044 / 15,977,267 / 15,989,583).
- Score‑first TTT: each validation token is scored under `torch.inference_mode()` before any model update.
- N‑gram cache legality: **contested**. The cache is backward‑looking only, uses zero artifact bytes, and produces Laplace‑smoothed probabilities that form a proper normalized distribution. [PR #727](https://github.com/openai/parameter-golf/pull/727) (closed, 0.9674 bpb) used the same technique and spawned followup PRs (#753, #778, #782, #786). However, OpenAI opened [issue #677](https://github.com/openai/parameter-golf/issues/677) on 2026‑03‑25 questioning the legality of eval‑time cache methods. This submission may face review scrutiny regardless of score validity.
- Phase‑1 TTT (`TTT_PHASE1_ENABLED`): disabled by default (rule‑violating).
- No network access during training or eval beyond local `nvidia-smi`.

## Reproduce

The script auto‑resolves data paths relative to the repo root (via `_REPO_ROOT`), so it works from both the repo root and from within `records/track_10min_16mb/<submission>/`.

```bash
# From repo root after cloning:
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024

# Seed 1337
SEED=1337 RUN_ID=seed_1337 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py

# Seed 42
SEED=42 RUN_ID=seed_42 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py

# Seed 2025
SEED=2025 RUN_ID=seed_2025 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.py
```

Alternatively, override paths explicitly:
```bash
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Hardware: 8× H100 SXM (RunPod), CUDA 12.8, PyTorch 2.9+.

## Key Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `MAX_WALLCLOCK_SECONDS` | 600.0 | Hard training time cap (seconds) |
| `NGRAM_CACHE` | 1 | Enable N‑gram backoff cache |
| `NGRAM_ORDER` | 9 | Max n‑gram order |
| `TTT_ENABLED` | 1 | Enable legal score‑first TTT |
| `TTT_PHASE1_ENABLED` | 0 | **Off** — violates rules if enabled |
| `XSA_LAST_N` | 5 | Layers using exclusive self‑attention |
| `VE_ENABLED` | 1 | Value embedding on layers 8/9/10 |
| `QAT_ENABLED` | 0 | Quantization‑aware training |
| `MOD_ENABLED` | 0 | Enable token routing masks in attention/MLP blocks |
| `SQUAREGLU_ENABLED` | 0 | Use SquareGLU gated MLP path |
| `EMA_DISTILL_ENABLED` | 0 | Enable EMA teacher distillation in warmdown |
| `GROKFAST_ENABLED` | 0 | Enable Grokfast gradient low-pass filtering |

## Eval Timing Budget (8×H100)

| Phase | Time |
|-------|------|
| Training (wallclock‑capped) | ≤ 600 s |
| Standard eval (int6 roundtrip + sliding window s64) | ~82 s |
| Legal TTT + N‑gram cache | ~420 s |
| **Total eval (timed phases)** | **~502 s** |

**Note**: the ~502 s figure covers only the timed eval phases. `torch.compile` warmup and model deserialization add additional overhead (~5–15 s) that occurs outside these timed blocks. Total end‑to‑end eval is estimated at ~515–520 s.

## Credits

Built on [modded‑nanogpt](https://github.com/KellerJordan/modded-nanogpt). Key technique credits: [PR #727](https://github.com/openai/parameter-golf/pull/727) (N‑gram backoff + entropy‑adaptive alpha), [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² + TTT + Muon SOTA stack), [PR #461](https://github.com/openai/parameter-golf/pull/461) (score‑first TTT protocol), [PR #659](https://github.com/openai/parameter-golf/pull/659) (original N‑gram cache).
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Parameter Golf - train_gpt.py dependencies
# PyTorch 2.9+ with CUDA 12.8 (8xH100 SXM target)
torch>=2.9.0
numpy>=1.26.0
sentencepiece>=0.2.0

# Optional but recommended for best performance
zstandard>=0.23.0 # zstd compression (falls back to zlib if missing)
flash-attn-hopper>=2.0 # FlashAttention 3 for Hopper GPUs (falls back to SDPA if missing)
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{
"author": "Koustav Sarkar",
"github_id": "skoustav",
"name": "LeakyReLU² + Legal Score-First TTT + N-gram Backoff Cache + Gated Attention",
"blurb": "11L XSA-5 + Score-first TTT (SGD, 3ep, 32K chunks, stride 64) + multi-order N-gram backoff cache (order 9, Laplace-smoothed, entropy-adaptive alpha) + gated attention + value residual + VE(128) on layers 8,9,10 + MTP-2 + BigramHash(2048) + CROWN-Q + LeakyReLU(0.5)². N-gram cache uses community-contested eval-time technique (see PR #727, issue #677). 3-seed exact mean: 0.96412237 BPB / 1.62735261 nats.",
"date": "2026-03-30",
"track": "10min_16mb",
"val_loss": 1.62735261,
"val_bpb": 0.96412237,
"val_loss_std": 0.00120441,
"val_bpb_std": 0.00071386,
"seeds": [1337, 42, 2025],
"seed_results": {
"1337": {
"val_loss": 1.62740282,
"val_bpb": 0.96415208,
"artifact_bytes": 15982044,
"steps": 7185,
"step_avg_ms": 83.42,
"train_time_ms": 599384
},
"42": {
"val_loss": 1.62853112,
"val_bpb": 0.96482092,
"artifact_bytes": 15977267,
"steps": 7182,
"step_avg_ms": 83.51,
"train_time_ms": 599761
},
"2025": {
"val_loss": 1.62612388,
"val_bpb": 0.96339412,
"artifact_bytes": 15989583,
"steps": 7196,
"step_avg_ms": 83.32,
"train_time_ms": 599618
}
},
"comparison_baseline_pr": 1019,
"comparison_baseline_bpb": 1.11473509,
"comparison_baseline_nats": 1.88217853,
"implementation_lineage_pr": 727,
"delta_vs_baseline_nats": -0.25482592,
"delta_vs_baseline_bpb": -0.15061272,
"t_statistic": -328.29,
"welch_df": 2.93,
"artifact_bytes_mean": 15982965,
"artifact_bytes_max": 15989583,
"bytes_total": 15989583,
"bytes_code": 115769,
"bytes_model_max": 15873814,
"train_steps_mean": 7187.67,
"step_avg_ms_mean": 83.42,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"cuda_version": "12.8",
"flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
"ngram_cache_note": "Eval-time N-gram backoff cache with Laplace smoothing. Legality is community-contested: see PR #727 (closed), issue #677 (open). Cache uses zero artifact bytes, is backward-looking only, and forms a proper normalized distribution via add-1 smoothing.",
"technique_summary": "Score-first TTT + N-gram backoff cache (order 9) + Gated Attention + Value Residual + XSA-5 + VE + MTP-2 + BigramHash 2048 + CROWN-Q + LeakyReLU²; codebase additionally includes opt-in MoD routing, SquareGLU, EMA distillation, and Grokfast toggles (disabled in reported baseline)."
}
Loading