64 lines (49 loc) · 2.96 KB

PLANS.md — Operational Queue

Last updated: 2026-03-25

Active Agent

(none)

Baseline

Target to beat: 1.0781 BPB (PR #672, TTT_EPOCHS=30 Cosine TTT, 3-seed mean std=0.0041)
Merged SOTA: 1.1194 BPB (March 23, LeakyReLU² + Legal TTT + Parallel Muon)
Base script: records/track_10min_16mb/PR672_CosineTTT30_1.0781/train_gpt.py
Stock baseline: 1.2304 BPB (8xH200, 10 min)

Experiment Status

KEEP (validated results)

Name	BPB	Baseline BPB	Delta	GPU	Notes
XSAAll_H200_Clean	2.4259	2.4514	-0.0255	1xH200	XSA on all 11 layers confirmed positive. Only 860 steps (1xH200), not leaderboard-scale.

SUBMITTED (awaiting results — DO NOT resubmit or cancel)

Name	Jobs	GPU	Technique
MuonVS_A100_Clean	5408902/5408903	1xA100	Variance-scaled Muon optimizer
TTT30_Clean	5408918/5408919	1xH200	Legal TTT with 30 epochs
DepthRecurrence6L2X_Clean	5408935/5408936	1xA100	6-layer U-Net with 2x repetition
FullGPTQ_H200_Clean	5408981/5408982	1xH200	Full Hessian GPTQ quantization
SwiGLU_A100_Clean	5409004/5409005	1xA100	SwiGLU gated MLPs (parameter-neutral)

PENDING (clean scripts, re-evaluate against PR672 baseline before running)

TTTPerLayerLR_Clean — Per-layer TTT LR groups (3x output, 0.5x input)
TTTGradClip08_Clean — TTT grad clip 0.8 vs 1.0
CurriculumSeqLen_A100_Clean — Curriculum seq len 512→2048
TTT5ep_Clean — TTT 5 epochs + skip non-TTT eval
AdEMAMix_A100_Clean — AdEMAMix on Adam-side groups
XSAAll_Warmdown4000_A100 — XSA-all + warmdown 4000 iters
XSA11_VR_GA_Clean — XSA-all + value residual + gated attention

ARCHIVED

~50 stale attempt folders moved to attempts/archive/
~55 stale backlog scripts moved to job_backlog/archive/
See attempts/results.tsv for the full historical record

Next Directions

Since PR #672 already maxes TTT at 30 epochs (590s eval), the highest-leverage improvements are orthogonal to TTT:

XSA-all-11 — Confirmed positive (delta -0.0255 on 1xH200). Stack into every new experiment.
Full Hessian GPTQ — Halves quantization gap. Awaiting results from FullGPTQ_H200_Clean.
SwiGLU MLP — Parameter-neutral gated MLP. Awaiting results from SwiGLU_A100_Clean.
Muon-VS — Variance-adaptive optimizer. Awaiting results from MuonVS_A100_Clean.
Architecture search — Deeper/wider within 16MB constraint. Depth recurrence test running.

See plan.md for detailed research priorities and technique analysis.

Session Handoff Protocol

Before ending any session:

Collect results from submitted jobs (rsync logs from Explorer)
Update attempts/results.tsv with any new results
Update this file: move completed experiments to KEEP/ARCHIVED, add new PENDING items
Clear the Active Agent field