Skip to content

Commit d217bf5

Browse files
Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)
Non-record submission. Pre-quant TTT violates Condition 3 of Issue #1017 (score-before-update). Submitted as technique study documenting: - Condition 3 boundary quantification (illegal TTT -0.044 vs legal -0.002) - Compiled TTT (torch.compile 2x speedup, applicable to legal TTT) - Artifact budget engineering (VE dim optimization, pruning analysis) 3-seed mean sliding BPB: 1.05869 (std 0.00038) All artifacts under 16,000,000 bytes. Zero pruning needed.
1 parent 45bbccf commit d217bf5

File tree

6 files changed

+3230
-0
lines changed

6 files changed

+3230
-0
lines changed
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence
2+
3+
> **Compliance note:** This submission violates Condition 3 of Issue #1017 (score-before-update). Pre-quant TTT fine-tunes on val tokens before scoring them. Submitted as a technique study, not a leaderboard claim.
4+
5+
**val_bpb = 1.0587** (3-seed mean, std 0.0004) | **~15.5 MB** | 8xH100 SXM
6+
7+
## 3-Seed Results
8+
9+
| Seed | Sliding BPB | Roundtrip BPB | Artifact |
10+
|------|-------------|---------------|----------|
11+
| 42 | 1.05840 | 1.06847 | 15,477,275 |
12+
| 1337 | 1.05856 | 1.06904 | 15,439,370 |
13+
| 2024 | 1.05912 | 1.06921 | 15,480,770 |
14+
| **Mean** | **1.05869** | **1.06891** | **15,465,805** |
15+
| **Std** | **0.00038** | **0.00037** | |
16+
17+
## Why this is a useful non-record
18+
19+
### 1. Quantifying the Condition 3 boundary
20+
21+
This submission provides a controlled measurement of how much BPB improvement comes from violating Condition 3:
22+
23+
| Configuration | BPB | Source | Measured? |
24+
|---|---:|---|---|
25+
| Post-EMA (no TTT, no GPTQ) | 1.1028 | This submission | Yes |
26+
| **Post-GPTQ sliding (illegal 6-epoch TTT)** | **1.0587** | This submission | Yes |
27+
| Post-GPTQ sliding (no TTT) | ~1.106 | Estimated: post-EMA + ~0.003 GPTQ gap | No |
28+
| Post-GPTQ sliding (legal score-first TTT) | ~1.104 | Estimated from PR #1493 delta (-0.002) | No |
29+
30+
The two measured points bound the illegal TTT contribution at **-0.044 BPB** (post-EMA 1.103 → post-GPTQ sliding 1.059). For comparison, the legal score-first TTT in merged PR #1493 contributes approximately -0.002 BPB (sliding 1.083 → TTT 1.081). This is not an apples-to-apples comparison — the illegal variant uses AdamW for 6 full epochs while the legal variant uses SGD for 3 epochs per chunk, on a different base model — but the order-of-magnitude gap illustrates why Condition 3 is load-bearing.
31+
32+
**On the theoretical ceiling:** Issue #1017 states: *"Corpus-level TTT has a ceiling of approximately 0.0003 bits"* — this refers specifically to the gain from closing the train-val distribution gap, which the author measured as negligible for FineWeb. However, the author also notes that *"a model that undertrained on the training distribution can still benefit from additional learning at test time."* This means legal TTT can legitimately exceed the 0.0003 ceiling if the model hasn't fully converged during training (our 600s-capped model is certainly in this regime). The merged #1493's legal TTT gain of -0.002 BPB is consistent with this — it reflects real undertraining compensation, not memorization.
33+
34+
Our illegal TTT's -0.044 BPB gain, however, is 22x larger than legal TTT on a similar architecture. This magnitude is not explainable by undertraining compensation alone and is consistent with memorization of the validation set. A per-epoch ablation (not performed in this submission) would strengthen this argument: if the gain scales roughly linearly with epoch count rather than saturating quickly, that would be a direct memorization signature.
35+
36+
### 2. Compiled TTT: torch.compile for test-time training
37+
38+
We demonstrate that `torch.compile(dynamic=False, fullgraph=True)` can be applied to TTT models for a **2x speedup** (860s → 426s for 6 epochs). This is safe because:
39+
40+
- TTT operates in train mode with `torch.autocast`
41+
- No `torch.inference_mode()` — avoids rotary cache poisoning (a $60+ lesson from our development)
42+
- Fresh model instance created before TTT (deletes compiled training model, resets dynamo)
43+
- Compilation overhead (~20s) amortized over multiple epochs
44+
45+
This technique applies equally to legal score-first TTT and would reduce eval-time TTT costs.
46+
47+
### 3. Artifact budget engineering under 16MB
48+
49+
With SP8192, fitting under 16MB required careful component analysis:
50+
51+
| Component | Compressed Cost | BPB Benefit | Decision |
52+
|---|---:|---|---|
53+
| BigramHash 2048×128 | +109KB | ~0.001 at SP8192 | **Dropped** — marginal at large vocab |
54+
| VE dim=128 → dim=44 | -340KB | -0.001 | **Shrunk** — optimal via EV analysis |
55+
| VE dim=44 → dim=0 | -150KB | -0.001 | Kept — positive expected value |
56+
57+
We optimized VE dimension by sweeping dims 0-128, measuring compressed artifact size at each, computing pruning probability vs BPB tradeoff, and selecting the dimension that minimized expected BPB accounting for pruning risk. dim=44 gives 0% pruning risk with 39KB margin.
58+
59+
## Compliance Statement
60+
61+
**This submission violates Condition 3 of Issue #1017.** Pre-quant TTT (lines 2417-2455 of `train_gpt.py`) runs 6 AdamW epochs on the full val stream before GPTQ quantization. The same tokens are then scored via sliding window evaluation. No score-before-adapt discipline is implemented. This pattern is structurally identical to the closed PR #1376 and the withdrawn PR #1485 (@ndokutovich acknowledged the violation).
62+
63+
## Key Techniques
64+
65+
1. **SP8192 + GPTQ SDClip** — int6 matrices (k=12.85), int8 embeddings (k=20.0) (PR #1394 @clarkkev)
66+
2. **3-Layer Depth Recurrence** (L3-5, 14 virtual from 11 physical) (PR #1493 @bigbag)
67+
3. **Parallel Residuals** (L7+, GPT-J style) (PR #1412 @Robby955, PR #1204 @msisovic)
68+
4. **Pre-Quant AdamW TTT** — 6 epochs, `torch.compile` 2x speedup, freeze 2 blocks (PR #1485 @ndokutovich)
69+
5. **QK-Gain 5.25** + MuonEq-R (Polar Express) + EMA 0.9965 + warmdown 72% (PR #1493 @bigbag)
70+
71+
## Architecture
72+
73+
11L × 512d × 8H/4KV, MLP 4× (2048), LeakyReLU(0.5)², Partial RoPE (16/64), LN scale, tied embeddings, softcap=30. Depth recurrence [0,1,2,3,4,5,3,4,5,6,7,8,9,10] = 14 virtual layers. Parallel residuals L7+. XSA all layers. VE dim=44 L9-10. SmearGate.
74+
75+
## Training
76+
77+
MuonEq-R (Polar Express, 4 NS steps) + AdamW. ~5160 steps in 600s on 8×H100 SXM. Linear warmdown to 0 over final 72%. EMA 0.9965. Late QAT at LR scale < 15%.
78+
79+
## Pre-Quant AdamW TTT (VIOLATES CONDITION 3)
80+
81+
Fine-tunes the EMA model on the full validation token stream before GPTQ:
82+
83+
- `torch.compile(dynamic=False, fullgraph=True)` for 2x speedup (426s vs 860s)
84+
- AdamW, lr=0.0005, weight_decay=0.0, cosine decay to lr×0.1
85+
- 6 epochs, freeze first 2 transformer blocks
86+
- Batch: 32 sequences × 2048 tokens, grad clip 1.0
87+
- Fresh model instance (avoids inference_mode rotary cache poisoning)
88+
89+
## Quantization
90+
91+
GPTQ int6 SDClip (k=12.85) + int8 embeddings (k=20.0). 32 AR self-gen calibration sequences. Brotli-11 compression. Zero pruning on all seeds.
92+
93+
## Reproduction
94+
95+
```bash
96+
pip install brotli sentencepiece
97+
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
98+
VOCAB_SIZE=8192 BIGRAM_VOCAB_SIZE=0 VE_DIM=44 SEED=42 \
99+
torchrun --standalone --nproc_per_node=8 train_gpt.py
100+
```
101+
102+
## Credits
103+
104+
PR #1394 @clarkkev, PR #1493 @bigbag, PR #1485 @ndokutovich, PR #1412 @Robby955, PR #1204 @msisovic, PR #1285 @dexhunter, PR #549 @abaybektursun
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"score": 1.05869,
3+
"score_metric": "val_bpb",
4+
"seeds": {
5+
"42": {
6+
"sliding_bpb": 1.0584,
7+
"roundtrip_bpb": 1.06847,
8+
"artifact_bytes": 15477275
9+
},
10+
"1337": {
11+
"sliding_bpb": 1.05856,
12+
"roundtrip_bpb": 1.06904,
13+
"artifact_bytes": 15439370
14+
},
15+
"2024": {
16+
"sliding_bpb": 1.05912,
17+
"roundtrip_bpb": 1.06921,
18+
"artifact_bytes": 15480770
19+
}
20+
},
21+
"mean_sliding_bpb": 1.05869,
22+
"std_sliding_bpb": 0.00038,
23+
"mean_roundtrip_bpb": 1.06891,
24+
"max_artifact_bytes": 15480770,
25+
"hardware": "8xH100 SXM",
26+
"train_wallclock_seconds": 600,
27+
"track": "10min_16mb"
28+
}

0 commit comments

Comments
 (0)