Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f2352b6
v6.0 moonshot: LoRA TTT + In-Place TTT + surprise gating + eval-only XSA
sunnypatneedi Mar 23, 2026
3e72caf
Add Full GPTQ (Hessian-aware quantization) — 31% quant gap reduction
sunnypatneedi Mar 23, 2026
da5af1b
Fix critical bugs in In-Place TTT: add scoring + weight reset
sunnypatneedi Mar 24, 2026
fe5f0a7
Update README with full v6.0 feature set and experiment plan
sunnypatneedi Mar 24, 2026
69bd791
Fix artifact size (disable int7) + reduce TTT epochs (20→5)
sunnypatneedi Mar 24, 2026
2ee91b8
Disable GPTQ — root cause of 0.18 bpb quant penalty in Run 1
sunnypatneedi Mar 24, 2026
fe2eaba
Run 0 (baseline) + Run 1 (LeakyReLU only) — retro-disciplined approach
sunnypatneedi Mar 24, 2026
c9cb225
Run 0/1 rebased on merged SOTA PR #414 (1.1228), not unverified PR #548
sunnypatneedi Mar 24, 2026
f03921a
Run 2: + temperature calibration sweep (T=0.95-0.99)
sunnypatneedi Mar 24, 2026
2a7b9a4
Run 3: int5 quantization + 3.5x MLP (~33.6M params in 16MB)
sunnypatneedi Mar 24, 2026
f4f7811
Add submission template — ready to fill after 3-seed validation
sunnypatneedi Mar 24, 2026
15c598b
Session 3 retro: In-Place TTT falsified, GradQuant over budget
sunnypatneedi Mar 24, 2026
4d3dca6
Fix flash_attn_interface import + add SDPA fallback for all runs
sunnypatneedi Mar 24, 2026
4c473be
Fix flash_attn import: try FA3, then FA2, then SDPA fallback
sunnypatneedi Mar 24, 2026
53d1c27
v8.0 Phase 1: PR #549 baseline + TTT enabled (2 lines changed)
sunnypatneedi Mar 24, 2026
5969a30
Add AdamW TTT (PR #481 recipe) to submission script
sunnypatneedi Mar 25, 2026
7705d5a
Add torch._dynamo.reset() after TTT to fix cross-seed compile crash
sunnypatneedi Mar 25, 2026
4860252
Add submission: AdamW TTT (30ep cosine + per-layer LR) — val_bpb 1.0705
sunnypatneedi Mar 25, 2026
902f42a
Fix submission.json: add seeds, track, rename bytes_total to artifact…
sunnypatneedi Mar 26, 2026
bd91c4a
Add v10 moonshot: ternary MLP quant + scaled model + hedge mixer + en…
sunnypatneedi Mar 26, 2026
7341e5f
Add validate_configs.py and initial experiments.jsonl
sunnypatneedi Mar 26, 2026
26f1d02
Merge pull request #1 from sunnypatneedi/claude/peaceful-mclean
sunnypatneedi Mar 26, 2026
c6ec05f
Merge pull request #2 from sunnypatneedi/claude/priceless-rosalind
sunnypatneedi Mar 26, 2026
dd92512
Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)
sunnypatneedi Mar 26, 2026
b50702a
Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)
sunnypatneedi Mar 26, 2026
8834070
Merge pull request #4 from sunnypatneedi/claude/quizzical-joliot
sunnypatneedi Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# CLAUDE.md — Parameter Golf AI Agent Instructions

---

## TL;DR

**Parameter Golf**: Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100 SXM GPUs, scored by compression quality (bits-per-byte) on FineWeb validation.

**Challenge**: https://github.com/openai/parameter-golf | https://openai.com/index/parameter-golf/

**Core Delivery**: Lowest val_bpb score | 16MB artifact constraint | 10-min train budget | 10-min eval budget | Tokenizer-agnostic BPB metric

**NOT**: A general LLM training framework | A production inference system | A data engineering project. This is a constrained optimization competition — every decision optimizes for val_bpb within the 16MB/10-min constraints.

**Stack**: Python 3 + PyTorch (CUDA/H100) + MLX (Apple Silicon local dev) + SentencePiece + zstd/zlib compression

**Repo**: Single-repo with baseline scripts at root and competition submissions in `records/`

---

## Critical Rules

1. **16MB artifact limit**: code (`train_gpt.py`) + compressed model weights must be < 16,000,000 bytes (decimal, not MiB). Check artifact size on EVERY experiment.
2. **No network during eval**: The artifact must be fully self-contained. No downloads, no API calls during evaluation.
3. **Validation data is sacred**: NEVER access validation data during training. Test-time training is ONLY allowed on validation tokens already evaluated (already graded).
4. **train_gpt.py is the submission**: All counted code lives in this single file. Submissions are self-contained folders in `records/`.
5. **Don't edit baseline scripts for competition work**: `train_gpt.py` (root) and `train_gpt_mlx.py` are onboarding scripts. Competition work goes in `records/` folders.
6. **Statistical significance required**: New SOTA must beat existing by >=0.005 nats with p<0.01 across 3 seeds.
7. **MLX is for learning, not tuning**: MLX and CUDA have different numerical paths (float32 vs bf16 Muon). Never trust absolute bpb numbers from MLX runs.
8. **Always run the quantization roundtrip**: Post-quant val_bpb is the submission score, not pre-quant.
9. **Shut down RunPod pods when idle** — $3+/hr adds up fast.
10. Plan before building — non-trivial changes get a written hypothesis first.

---

## Security & Safety

### Destructive Operations

**PROHIBITED without explicit user confirmation**:
- **RunPod**: Deleting pods with unsaved work, terminating running experiments
- **Git**: push --force, reset --hard, deleting branches with experiment results
- **Data**: Deleting downloaded dataset shards (16GB+ redownload)

### Agent Boundaries

**NEVER autonomously**: spend RunPod credits (always confirm before launching pods) | modify the root `train_gpt.py` or `train_gpt_mlx.py` for competition purposes | submit PRs to the upstream repo | delete experiment logs

**ALWAYS**: track experiment results with hypothesis and verdict | verify artifact size < 16,000,000 bytes | run 3 seeds before claiming a result | check competition rules before novel eval approaches

---

## Context Documents

| File | When to Read |
| ---- | ------------ |
| `README.md` | Challenge rules, leaderboard, submission process, FAQ |
| `documents/raise-the-floor.md` | When output quality drops or agent oscillates between good and bad |
| `documents/testing-guide.md` | Before designing experiments or validating results |
| `data/README.md` | Dataset download, tokenizer variants, shard format |
| `records/track_10min_16mb/2026-03-22_FullStack_v51/README.md` | Our v5.1 submission (full stack) |
| PR #486 (branch `pr-486`) | Current SOTA: TrigramHash + ValueResidual + GradQuant + TTT |
| PR #503 (branch `pr-503`) | XSA on all layers + legal score-first TTT + Partial RoPE |
| PR #481 (branch `pr-481`) | Best TTT reference: cosine + per-layer LR |
| PR #490 (branch `pr-490`) | Value Residual + Gated Attention + TTT |
| `records/track_10min_16mb/2026-03-20_10L_Int5MLP_.../README.md` | Previous SOTA (1.1428) techniques and ablation |
| `records/track_10min_16mb/2026-03-17_LoRA_TTT/README.md` | LoRA TTT reference (abandoned — hurts) |

---

## Project Structure

```
parameter-golf/
├── train_gpt.py # Baseline CUDA training script (1126 lines) — DO NOT edit for competition
├── train_gpt_mlx.py # Baseline MLX script for local dev (1104 lines) — DO NOT edit for competition
├── requirements.txt # Python dependencies reference
├── data/
│ ├── cached_challenge_fineweb.py # Dataset downloader (supports sp1024/sp2048/sp4096)
│ ├── datasets/ # Downloaded training shards + validation
│ └── tokenizers/ # SentencePiece models
├── records/
│ ├── track_10min_16mb/ # Competition submissions (17 entries)
│ │ ├── 2026-03-17_NaiveBaseline/ # Baseline: 1.2244 val_bpb
│ │ ├── 2026-03-20_10L_Int5MLP_*/ # SOTA: 1.1428 val_bpb
│ │ └── ...
│ └── track_non_record_16mb/ # Unlimited compute submissions
└── logs/ # Training run logs
```

**Commands**:
```bash
# Local (MLX, Apple Silicon)
RUN_ID=test ITERATIONS=200 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 python3 train_gpt_mlx.py

# RunPod (CUDA, 1xH100)
torchrun --standalone --nproc_per_node=1 train_gpt.py

# RunPod (CUDA, 8xH100 — final validation only)
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

---

## Session Protocol

**Start**: Check current SOTA on leaderboard (README.md) → Review experiment log → Read active plan

**End**: Log experiment results (hypothesis, numbers, verdict) → Stop RunPod pod if running → Update plan if approach changed

---

## Competition Strategy

**Merged leaderboard SOTA**: 1.1228 val_bpb (signalrush, 2026-03-22)
**Best open PR (unmerged)**: 1.0865 val_bpb (PR #548 LoquiAuris, per-doc LoRA TTT, pending verification)
**Target**: Beat merged SOTA by >=0.005 nats. If open PRs merge first, target moves.

**The AdamW TTT revolution**: LoRA TTT hurts (+0.004 bpb). AdamW TTT with aggressive config gives **-0.04 to -0.06 bpb** — the single biggest unlock. Every sub-1.10 submission uses it. Config: 30 epochs, cosine LR decay, lr=0.0005, per-layer LR (MLP output 3x, input 0.5x).

**Our approach (v5.1 — full stack)**:
1. **AdamW TTT** (30 epochs, cosine, per-layer LR) — -0.04 to -0.06 bpb (from PR #481)
2. **XSA on all 11 layers** — exclusive self-attention, -0.002 to -0.005 bpb (from PR #503)
3. **Value Residual (ResFormer)** — blend V vectors from layer 0, 22 params (from PR #486)
4. **GradQuant** — gradient-guided adaptive Int5/6/7 quantization (from PR #486)
5. **TrigramHash(4096)** — 3-gram context embedding (from PR #486)
6. **Partial RoPE (16/64)**, **LN Scale**, **EMA (0.997) + SWA (every 50)**, **11 layers**

**Key insight**: No submission combines ALL proven techniques. PR #486 lacks XSA. PR #503 has XSA but conservative TTT. Our edge is the combination.

**Key reference PRs**: #486 (SOTA 1.0887), #490 (1.0891), #481 (1.0970, best TTT ref), #503 (1.1218, XSA ref)

**Abandoned approaches**: LoRA TTT (hurts), product quantization (SWA-incompatible), larger vocab (embedding cost), custom Triton kernels (poor EV), int4 without QAT (quality-destructive at this scale), eval stride=32 (exceeds time budget with 30-epoch TTT).

---

## Technique Reference

| Technique | Approx Δ bpb | Status |
|-----------|-------------|--------|
| **AdamW TTT (30 ep, cosine, per-layer LR)** | **-0.04 to -0.06** | **In SOTA + our submission** |
| Sliding window eval (stride=64) | -0.032 | In SOTA |
| TrigramHash + ValueResidual + GradQuant | -0.023 | In SOTA (PR #486) |
| 3× MLP expansion | -0.015 | In SOTA |
| Int6 QAT + GradQuant adaptive Int5/6/7 | -0.010 | In SOTA |
| **XSA (all 11 layers)** | **-0.002 to -0.005** | **Our addition** |
| SmearGate + BigramHash(4096) | -0.006 | In SOTA |
| Value Residual (ResFormer) | -0.005 to -0.017 | In SOTA |
| 11 layers | -0.003 | In SOTA |
| EMA (0.997) + SWA (every 50) | -0.002 | In SOTA |
| Partial RoPE (16/64) + LN Scale | -0.002 | In SOTA |
| Orthogonal init + Muon WD=0.04 | -0.003 | In SOTA |
| LoRA TTT | **+0.004 (HURTS)** | **Abandoned** |

---

## Experiment Tracking

One row per run in `logs/experiments.md`:
```
Date | Exp ID | Change | val_bpb (slide) | Artifact bytes | Steps | Hypothesis → Verdict
```

Rules:
- Change ONE thing per run
- Record negative results explicitly
- 3 seeds only for submission-quality results
- Current byte headroom: ~660 KB (SOTA artifact is 15.34MB / 16.00MB with GradQuant)

---

## Key Constraints Cheat Sheet

| Constraint | Value |
|-----------|-------|
| Artifact size | < 16,000,000 bytes (code + compressed model) |
| Training time | 10 minutes on 8xH100 SXM |
| Eval time | 10 minutes on 8xH100 SXM (separate budget) |
| Network during eval | Prohibited |
| Val data during training | Prohibited |
| TTT rule | Only on tokens already evaluated |
| SOTA improvement threshold | >=0.005 nats, p<0.01, 3 seeds |
| Competition deadline | April 30, 2026 |

---

## Lessons Learned

### Session 1 (2026-03-22)
1. **Ship experiments first, debate strategy second.** Time-box planning to 30 min. Run a GPU experiment in the first hour, not the fifth.
2. **Always use `nohup` for RunPod commands.** SSH drops on 15-min runs. Pattern: `nohup bash -c 'CMD > /workspace/run.log 2>&1' &`
3. **Never launch parallel torchrun on the same pod.** Two jobs on 8xH100 corrupt each other. Run sequentially.
4. **1xH100 cannot run SOTA-class models.** Only use for baseline-scale experiments or code debugging. Always use 8xH100 for SOTA work.
5. **The leaderboard moves daily.** Check BOTH merged leaderboard AND open PRs before every session.
6. **TTT gains diminish on stronger bases.** -0.075 on 1.16 base → -0.022 on 1.11 base. Always verify TTT improvement on YOUR architecture first.
7. **Stride=32 is not significant.** Tested 3 seeds: only -0.0005 nats over stride=64. Don't revisit.

### Session 2 (2026-03-23)
8. **NEVER ship unverified quantization code.** GPTQ caused 0.18 bpb quant penalty (expected 0.003). Always compare pre-quant vs post-quant bpb before adding new quant methods. Quantization bugs are silent killers.
9. **First GPU run = UNMODIFIED baseline.** Establish baseline numbers before adding ANY changes. Then add ONE change at a time. Shipping 578 new lines in one run made debugging impossible.
10. **Compute TTT time budget before setting epochs.** `epochs × batches × time/batch`. 20 epochs × 71 batches × ~1s = 1420s. Basic math catches budget blowouts.
11. **Check disk quota before downloading data.** RunPod disk quotas are per-pod, not per-filesystem. 80 shards = ~16GB. Verify space first.
12. **Depth recurrence is falsified.** PR #540 got 1.2092 bpb (worse than 1.2244 baseline). Do not attempt.

### Session 3 (2026-03-24)
13. **In-Place TTT is HARMFUL.** Loss INCREASES (2.63+, going up not down). MLP output projections are NOT good TTT targets at this scale. Do not attempt.
14. **GradQuant int5/int6 mix exceeds 16MB.** Even without int7, the artifact was 34KB over. Use uniform int6 or match PR #414's exact quantization scheme.
15. **PR #486 baseline reproduced at 1.1249** (vs reported 1.1233). Within seed variance. This is our verified baseline.
16. **The v7.0 incremental plan works.** Run 0→1→2→3 from PR #414 base. Each run adds ONE thing. Stop doing moonshots with 500+ new lines.

## Golden Rules

Every change must answer: "Does this lower val_bpb within the 16MB/10-min constraints?" If the answer is unclear, run a quick experiment on 1xH100 before investing more time. Compression and eval tricks are as valuable as architecture changes. The cheapest experiment that gives signal is the best experiment. Speed > perfection — submit early, iterate after.

_Updated: 2026-03-23 (v6.0 — LoRA TTT + In-Place TTT moonshot, GPTQ disabled after Run 1 failure)_
76 changes: 76 additions & 0 deletions records/track_10min_16mb/2026-03-22_FullStack_v51/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Record: v6.0 Moonshot — Dual TTT + Full Architecture Stack

**Target: <1.03 val_bpb** (stretch: <0.99) | 8xH100 SXM, 600s train + 600s eval

### Novel Contributions

1. **Two independent TTT methods** (user chooses via config):
- **In-Place TTT** (ICLR 2026 Oral): Updates MLP output projections per-document using NTP loss with apply-then-update ordering. Targets completely different parameters than LoRA TTT.
- **Per-document LoRA TTT** (PR #548): Rank-8 LoRA on Q/V/LM head with surprise-gated training (Titans-inspired — only top-K% highest-loss tokens get gradient updates).

2. **Full GPTQ** (Hessian-aware quantization): 256-sample calibration, per-layer Hessian H=X^TX, column-wise int6 with Cholesky error compensation. 31% quantization gap reduction over naive int6.

3. **LeakyReLU(0.5)^2 activation**: Drop-in replacement for relu^2 that preserves gradients through negative activations. -0.0015 bpb proven by 4+ teams.

4. **Eval-only XSA** on all 11 layers: Exclusive Self-Attention removes self-position contribution during eval, forcing context-only prediction. Training proceeds without XSA to avoid regression.

### Architecture (from PR #486 base)

- 11 layers, 512 dim, 8 heads / 4 KV heads (GQA)
- 3x MLP LeakyReLU(0.5)^2 + SmearGate + BigramHash(4096) + TrigramHash(4096)
- Value Residual (ResFormer) across all layers
- GradQuant: gradient-guided adaptive Int5/6/7
- Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1))
- EMA (decay=0.997), OrthoInit
- Full GPTQ + zstd-22

### TTT Configuration (LoRA mode)

```bash
# LoRA TTT (default: INPLACE_TTT_ENABLED=0)
TTT_LORA_LR=0.01 # LoRA optimizer LR
TTT_LORA_RANK=8 # LoRA rank
TTT_EPOCHS=20 # Epochs per document
TTT_BATCH_SEQS=32 # Documents per GPU batch
TTT_SURPRISE_TOPK=0.5 # Train on top 50% highest-loss tokens
```

### TTT Configuration (In-Place mode)

```bash
# In-Place TTT (INPLACE_TTT_ENABLED=1)
INPLACE_TTT_LR=0.001 # MLP proj update LR
INPLACE_TTT_CHUNK=256 # Chunk size for apply-then-update
```

### Run Commands

```bash
# LoRA TTT (proven, batched)
INPLACE_TTT_ENABLED=0 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

# In-Place TTT (novel, per-document MLP adaptation)
INPLACE_TTT_ENABLED=1 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Experiment Plan

1. Reproduce PR #486 baseline (pre-TTT ~1.11, post-LoRA-TTT ~1.09)
2. Compare LoRA TTT vs In-Place TTT on our architecture
3. Tune surprise-gating threshold: {25%, 50%, 75%}
4. Tune In-Place TTT LR: {0.0005, 0.001, 0.002}
5. 3-seed validation with best config

### Results

_Pending — needs 8xH100 validation runs._

### Provenance

- Architecture: PR #486 (ndokutovich)
- LoRA TTT: PR #548 (LoquiAuris)
- In-Place TTT: "In-Place Test-Time Training" (ICLR 2026 Oral, Feng et al.)
- Surprise gating: Inspired by "Titans: Learning to Memorize at Test Time" (NeurIPS 2025)
- LeakyReLU^2: PR #493 (parinzee)
- Full GPTQ: PR #535 (raahilshah)
- XSA: PR #503 (EthanYangTW)
18 changes: 18 additions & 0 deletions records/track_10min_16mb/2026-03-22_FullStack_v51/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "11L XSA + TrigramHash + ValueResidual + GradQuant + TTT",
"author": "sunnypatneedi",
"date": "2026-03-22",
"track": "10min_16mb",
"base_prs": [486, 503],
"techniques": [
"XSA (all 11 layers)",
"TrigramHash(4096)",
"ValueResidual (ResFormer)",
"GradQuant (adaptive Int5/6/7)",
"AdamW TTT (30 epochs, cosine, per-layer LR)",
"Partial RoPE (16/64)",
"LN Scale (1/sqrt(layer+1))",
"EMA (0.997) + SWA (every 50)",
"SmearGate + BigramHash(4096)"
]
}
Loading