Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
6818cb1
Squashed 'modded-nanogpt/' content from commit 1cafa69
rarce Mar 20, 2026
e6f175f
Merge commit '6818cb1272096e12f2d1641a595c9a02e9833d49' as 'modded-na…
rarce Mar 20, 2026
68078b7
Squashed 'slowrun/' content from commit 7677821
rarce Mar 20, 2026
96d1aa4
Merge commit '68078b7590c21c866f86b44fe9be644e1d775597' as 'slowrun'
rarce Mar 20, 2026
f9ce8bf
Add SOTA review knowledge base with analysis of speedrun and slowrun …
rarce Mar 20, 2026
084125e
Add deep research on sub-100M model techniques for Parameter Golf
rarce Mar 20, 2026
f633889
Update knowledge base with submission analysis and R&D directions
rarce Mar 20, 2026
edda324
Update CLAUDE.md with knowledge base, frontier stack, and current state
rarce Mar 20, 2026
41ae39c
Implement consensus stack in train_gpt_mlx.py
rarce Mar 20, 2026
9d98490
Add int6 quantization (step=4 rounding) to consensus stack
rarce Mar 20, 2026
f8be9bb
Port consensus stack to train_gpt.py (PyTorch/CUDA)
rarce Mar 20, 2026
eb6ee74
Add RunPod automation script (bash + curl, no dependencies)
rarce Mar 20, 2026
35a8620
Update CLAUDE.md with consensus stack status and RunPod workflow
rarce Mar 20, 2026
78fbed9
Fix int6 dequantization: add weights_only=False to torch.load
rarce Mar 20, 2026
e150a6a
Fix int6 quantization: use post-hoc step rounding like reference impl
rarce Mar 20, 2026
1f7528f
Fix runpod.sh: match actual REST API v1 response format
rarce Mar 20, 2026
4d898d5
Apply int6 only to middle layers (mixed precision like reference)
rarce Mar 21, 2026
836c152
Fix runpod.sh setup: handle empty dir from template
rarce Mar 21, 2026
15f730b
Fix runpod.sh run: create logs/ dir before tee
rarce Mar 21, 2026
1c74307
Add XSA, EMA, and Late QAT to consensus stack
rarce Mar 21, 2026
6de7803
Fix Late QAT: move after approx_training_time_ms is computed
rarce Mar 21, 2026
16e3f90
Add deploy/tail/done shortcuts to runpod.sh
rarce Mar 21, 2026
eed689e
Merge remote-tracking branch 'upstream/main'
rarce Mar 21, 2026
9e92db8
Merge branch 'main' into sota-review
rarce Mar 21, 2026
e7d8bf0
Add complete competition analysis and original model proposal
rarce Mar 21, 2026
f8fdedd
Implement RingGolf Phase 1: depth recurrence architecture
rarce Mar 22, 2026
1a1de2c
Fix RingGolf: eliminate state_dict duplication and unroll core loop
rarce Mar 22, 2026
0b34042
Merge remote-tracking branch 'upstream/main'
rarce Mar 22, 2026
ad0d82b
Merge branch 'main' into sota-review
rarce Mar 22, 2026
3336675
Update all docs with experiment results and new PR analysis
rarce Mar 22, 2026
d51a1b8
Implement PR #374 eval-time optimization stack in MLX
rarce Mar 22, 2026
7b54711
WIP: Port PR #374 stack to train_gpt.py (CUDA) — partial
rarce Mar 22, 2026
3e79da5
Complete port of PR #374 eval-time stack to train_gpt.py (CUDA)
rarce Mar 22, 2026
a35d555
Fix DDP: find_unused_parameters=True for VE scales
rarce Mar 22, 2026
64bed5a
Add zstd-22 compression (falls back to zlib if unavailable)
rarce Mar 23, 2026
9830366
Update original_model.md with 8xH100 results
rarce Mar 23, 2026
be0c64f
Int6-all + 10% magnitude pruning to fit artifact under 16MB
rarce Mar 23, 2026
b50ea20
Reduce MLP hidden to 1408, int6 on layers 1-9 (not all)
rarce Mar 23, 2026
46241c8
MLX: MLP hidden 1408 + int6 on layers 1-9
rarce Mar 23, 2026
dcb70a2
Add spot instance support to runpod.sh
rarce Mar 23, 2026
8aa29f6
Non-record: 11L PR374 stack + GPTQ-lite (val_bpb=1.1804, 15.95MB)
rarce Mar 23, 2026
a9ec87a
Update submission: add requirements.txt, fix paths for records folder
rarce Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,7 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
.claude/settings.local.json
.env
.runpod_pod_id
117 changes: 117 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What This Is

OpenAI's "Parameter Golf" challenge: train the best language model that fits in a **16MB artifact** (code + compressed weights) and trains in **under 10 minutes on 8×H100s**. Scored by bits-per-byte (val_bpb) on a fixed FineWeb validation set. Inspired by NanoGPT Speedrunning but optimizing L(N) — lowest loss for fixed parameter count.

- **Baseline**: 9 layers, dim 512, vocab 1024, ~17M params → val_bpb 1.2244
- **SOTA merged**: 10 layers, FP16 embed, Muon WD, OvertoneInit → val_bpb 1.1748
- **Frontier (open PRs)**: 11L, int6 QAT, SWA, SmearGate, MLP 3× → val_bpb ~1.13
- **Our consensus stack**: 11L, dim 512, MLP 3×, ~26.5M params, int6 QAT+SWA+SmearGate → artifact 4.3MB (int6+zlib)

## Key Commands

### Data Download
```bash
python3 data/cached_challenge_fineweb.py --variant sp1024
# Smaller subset for local iteration
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
```

### Training (CUDA, RunPod / H100)
```bash
RUN_ID=consensus_v1 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```
Use `--nproc_per_node=8` for 8×H100. Override `MAX_WALLCLOCK_SECONDS=0` to remove the 10-minute cap.

### Training (MLX, Apple Silicon)
```bash
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm
RUN_ID=mlx_smoke ITERATIONS=200 TRAIN_BATCH_TOKENS=16384 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=16384 GRAD_ACCUM_STEPS=4 python3 train_gpt_mlx.py
```
Note: with seq_len=2048, minimum TRAIN_BATCH_TOKENS must be ≥2048. Use GRAD_ACCUM_STEPS=4 to fit in 16GB.

### RunPod Automation
```bash
source .env # loads RUNPOD_API_KEY
./scripts/runpod.sh create 1 # create 1×H100 pod (~$3/hr)
./scripts/runpod.sh setup # clone repo + download data
./scripts/runpod.sh run consensus_v1 # run training
./scripts/runpod.sh fetch consensus_v1 # get results locally
./scripts/runpod.sh terminate # delete pod
```

## Architecture

Both training scripts implement the **consensus stack** (hard cap: 1500 lines each):

- **`train_gpt.py`** (~1460 lines) — PyTorch/CUDA for RunPod/H100. GPT model with: 11 layers, MLP 3×, GQA, RoPE, SmearGate, logit softcap (20), tied embeddings. Muon optimizer with WD=0.038. Int6 QAT (STE), SWA during warmdown, FP16 embedding passthrough. Int6 step=4 quantization + zlib compression. LoRA TTT at eval. DDP multi-GPU.

- **`train_gpt_mlx.py`** (~1214 lines) — MLX port for local Apple Silicon iteration. Same consensus stack, adapted optimizers, eager eval mode for 16GB machines.

- **`data/`** — Dataset download and retokenization. Shards in `data/datasets/`, tokenizers in `data/tokenizers/`.

- **`records/`** — Submission history. Two tracks: `track_10min_16mb/` (leaderboard) and `track_non_record_16mb/` (unlimited compute).

- **`scripts/`** — Automation tools. `runpod.sh` for RunPod pod lifecycle (create/setup/run/fetch/terminate) using REST API with curl.

## Consensus Stack (implemented)

| Technique | Env Var | Default | Status |
|-----------|---------|---------|--------|
| 11 layers | `NUM_LAYERS` | 11 | Implemented |
| MLP 3× | `MLP_MULT` | 3 | Implemented |
| Seq 2048 | `TRAIN_SEQ_LEN` | 2048 | Implemented |
| SmearGate | — | always on | Implemented |
| Logit softcap 20 | `LOGIT_SOFTCAP` | 20 | Implemented |
| Muon WD 0.038 | `MUON_WEIGHT_DECAY` | 0.038 | Implemented |
| Int6 QAT (STE) | `QAT_ENABLED`, `QAT_BITS` | 1, 6 | Implemented |
| SWA | `SWA_ENABLED`, `SWA_EVERY` | 1, 50 | Implemented |
| FP16 embed passthrough | `FP16_EMBED_PASSTHROUGH` | 1 | Implemented |
| Int6 quantization | `QUANT_BITS` | 6 | Implemented |
| OrthoInit + muP | — | — | Not yet |
| BigramHash | — | — | Not yet |
| Sliding window eval | — | — | Not yet |

## Knowledge Base

Research and analysis in `docs/`:

- **`docs/README.md`** — Problem definition, leaderboard, submission analysis (merged vs PR frontier), R&D directions across 3 tiers
- **`docs/nanogpt-speedrun.md`** — 77 records from modded-nanogpt. Transferable: SmearGate, BigramHash, NorMuon, value embeddings, sliding window, U-net skips
- **`docs/nanogpt-slowrun.md`** — 27 records across 3 tracks. Key: heavy regularization, value projections from x0, per-head gating, layer looping, EMA/SWA
- **`docs/small-model-research.md`** — Sub-100M model techniques: MobileLLM, Depth Delusion, RingFormer, QAT, BitNet, optimizer advances

Reference subtrees: `modded-nanogpt/`, `slowrun/`

## Experiment Logs

Local runs are stored in `logs/` (gitignored). Each run has a subdirectory with README.md, training log, and model artifacts.

| Run | Params | Iters | val_bpb | Artifact | Notes |
|-----|--------|-------|---------|----------|-------|
| mlx_smoke_baseline | 17M | 200 | 2.3244 | 10.1MB int8 | Baseline, Apple Silicon |
| consensus_smoke_int8 | 26.5M | 10 | 3.6078 | 8.6MB int8 | Consensus stack, int8 quant |
| consensus_smoke_int6 | 26.5M | 10 | 3.6285 | **4.3MB int6** | Consensus stack, int6 quant |

## Submission Rules

- New SOTA must beat existing by ≥0.005 nats at p < 0.01 (typically 3 run logs)
- Artifact size = code bytes + compressed model bytes ≤ 16,000,000 (decimal, not MiB)
- Submissions are PRs that add a folder under `records/` with: `README.md`, `submission.json`, `train_gpt.py`, `train.log`
- No external downloads or network calls allowed during evaluation
- Eval time limit: 10 minutes on 8×H100 (separate from training time)

## Key Hyperparameters (env vars)

All configured via environment variables. See the `Hyperparameters` class in each script for the full list. Key new ones: `QAT_ENABLED`, `QAT_BITS`, `SWA_ENABLED`, `SWA_EVERY`, `SWA_BLEND_FINAL`, `MUON_WEIGHT_DECAY`, `QUANT_BITS`, `FP16_EMBED_PASSTHROUGH`.

## Dependencies

Core: `torch`, `numpy`, `sentencepiece`, `tqdm`. MLX path adds `mlx`. Data scripts need `huggingface-hub`, `datasets`. RunPod automation needs `curl`, `jq`. See `requirements.txt`.
Loading