Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions records/track_10min_16mb/2026-03-22_CosineTTT_PerLayer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Cosine TTT Scheduling with Per-Layer Learning Rates

Mean val_bpb = 1.0970 (3 seeds, std=0.0010) | 8×H100 SXM | 600s train + 465s TTT + 187s eval

## Results

| Seed | Steps | Pre-TTT | Post-TTT | Artifact |
|------|-------|---------|----------|----------|
| 1337 | 7,101 | 1.1577 | 1.0959 | 15.4 MB |
| 42 | 6,700 | 1.1588 | 1.0971 | 15.5 MB |
| 7 | 6,987 | 1.1580 | 1.0979 | 15.8 MB |

## Background

Starting from the community stack (PRs #162, #180, #315, #398), we spent several days exploring ways to improve compression and eval-time adaptation. Many of these did not improve the result but informed the direction that eventually worked.

### Compression research (did not improve score)

We analyzed trained checkpoints to evaluate alternative quantization and compression approaches:

- **Learned codebook quantization** (K-means, K=256): 87% lower reconstruction MSE than uniform int6, but 25% larger compressed artifact under zstd-22. Codebook indices have higher byte entropy than clamped int6 values.
- **Symmetry-transport** (Procrustes alignment across layers): Layers share 91-93% rotational structure, but storing the rotation matrices costs more than storing the weights directly. Low-rank approximation of the rotation delta (rank-128) captured only 16.6% of variance.
- **Embedding low-rank factorization** (SVD): Rank-64 explains 41.9% of variance on tok_emb (1024×512). Not viable at this vocabulary size.
- **Magnitude pruning**: Non-monotonic interaction with zstd-22. 3% pruning increased artifact size by 728KB on our checkpoint.

These results indicated that int6+zstd is close to optimal for this model architecture and that compression was not the path to further improvement.

### Architectural exploration (did not improve score)

- **Progressive layer dropping**: Randomly skipping layers during training for regularization. Caused 0.06 BPB regression at step 1000 when combined with head dropout. The DDP implementation also introduced higher-order ops incompatible with torch.compile + DDPOptimizer.
- **Depth recurrence** (Huginn-style, 3 shared blocks × 3 loops): Blocks learned position-specific functions rather than general refiners. Eval at 2× trained depth produced val_bpb 4.34. Not viable below ~100M params per unique layer.
- **Neural cache** (cross-window KV caching at eval): Implemented but not validated on hardware. The original proposal (PR #318) was blocked by a torch.compile issue.

### TTT analysis (led to the finding)

Analyzing our trained checkpoint, we observed:

1. **Quantization error is uniformly distributed** — the top 1% of weights by error magnitude account for only 3.9% of total reconstruction error. This confirmed that outlier protection approaches would not be effective.
2. **Quantization damage varies 3.4× across layer types** — MLP output projections (512×1536) have systematically higher relative error than input projections (1536×512).
3. **TTT improvement exceeds quantization repair** — the TTT contribution (~0.06 BPB on our model) is roughly 2.4× larger than the quantization gap (~0.008), indicating TTT performs distribution adaptation beyond repairing quantization damage.

These observations motivated exploring the TTT schedule rather than the training architecture or compression scheme.

## TTT schedule

Two modifications to AdamW TTT (PR #442):

**Cosine lr decay** over 30 epochs instead of flat lr over 10 epochs. Quantization introduces both large-scale damage (outlier weight rounding) and distributed noise (small perturbations across all weights). A flat lr must compromise between these two regimes. Cosine decay applies full lr early to address large damage, then progressively reduces to refine without overshooting.

**Per-layer lr groups** based on the quantization damage measurements above. MLP output projections receive 3× base lr, input projections 0.5×, all other parameters 1×. This allocates more adaptation capacity to more damaged layers. The ratios are specific to our model — other architectures may show different damage profiles.

We tested 34 TTT configurations across optimizers (AdamW, Adam, SGD), learning rates (1e-4 to 2e-3), epoch counts (3 to 30), schedules (flat, cosine, warmup+cosine), per-layer groupings, freeze strategies, and loss functions (cross-entropy, focal loss γ=1-3, KL divergence from pre-quant model).

Focal loss did not improve over cross-entropy — hard tokens appear to be unpredictable rather than undertrained. KL divergence from the pre-quant model was less effective than cross-entropy — the pre-quant and post-quant models are similar enough that the KL signal is weak relative to the cross-entropy signal from the validation data.

## TTT config

```
TTT_OPTIMIZER=adamw TTT_LR=0.0005 TTT_EPOCHS=30
TTT_COSINE=1 TTT_PERLAYER=1 TTT_FREEZE_BLOCKS=0
TTT_BATCH_SEQS=64 (per GPU, 512 total with DDP sharding)
```

Each GPU processes a contiguous 1/8 shard of the validation tokens with gradient all_reduce (ReduceOp.AVG). 30 epochs at ~15.5s/epoch = ~465s total.

## Training config

Standard community stack. 11L, 512d, 8H/4KV (GQA), 3x MLP (relu-squared), U-Net skips, SmearGate, BigramHash(2048), OrthoInit, Partial RoPE (16/64 dims), LN Scale, EMA(0.997), tied embeddings. XSA disabled. Int6 per-row + zstd-22.

## Notes

- All runs used FA2. FA3 Hopper would improve pre-TTT quality through faster training steps. The TTT schedule is independent of the attention kernel.
- The cosine + per-layer schedule adds no artifact cost and minimal code complexity over flat-lr TTT.
- See PR #212 for a non-record submission documenting 25+ additional experiments.

## Reproduction

```bash
git clone https://github.com/mrdavtan/parameter-golf.git
cd parameter-golf && git checkout next-gen
pip install flash-attn --no-cache-dir --no-build-isolation
pip install zstandard sentencepiece huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024
bash run_competition.sh 1337
```

Hardware: 8×H100 SXM (RunPod), PyTorch 2.9.1+cu128, Flash Attention 2

Builds on PRs #162, #180, #77, #398, #442, #417, #315, and modded-nanogpt.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"author": "mrdavtan",
"github_id": "mrdavtan",
"name": "Cosine TTT scheduling with per-layer lr (mean val_bpb=1.0970, 3 seeds)",
"blurb": "AdamW TTT with cosine lr decay and per-layer lr groups. 30 epochs, 3x lr for MLP output projections, 0.5x for input projections.",
"date": "2026-03-22",
"val_loss": 1.8504,
"val_bpb": 1.0959,
"mean_val_bpb": 1.0970,
"std_val_bpb": 0.0010,
"seed": 1337,
"num_seeds": 3,
"seed_results": {
"1337": 1.0959,
"42": 1.0971,
"7": 1.0979
},
"step_stop": 7101,
"wallclock_seconds": 600.0,
"ttt_time_seconds": 465.4,
"eval_time_seconds": 186.5,
"bytes_total": 15362557,
"bytes_model_int8_zstd": 15258143,
"bytes_code": 104414,
"hardware": "8xH100 SXM (RunPod), PyTorch 2.9.1+cu128, FA2",
"track": "track_10min_16mb",
"model": {
"num_layers": 11,
"model_dim": 512,
"num_heads": 8,
"num_kv_heads": 4,
"mlp_mult": 3,
"vocab_size": 1024,
"tie_embeddings": true,
"total_params": 26829913
},
"ttt_config": {
"optimizer": "adamw",
"lr": 0.0005,
"epochs": 30,
"cosine": true,
"perlayer": true,
"perlayer_proj_mult": 3.0,
"perlayer_fc_mult": 0.5,
"freeze_blocks": 0,
"batch_seqs_per_gpu": 64
}
}
Loading