Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions records/track_10min_16mb/2026-03-25_TTT_Aggressive_SGD/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Aggressive SGD TTT (val_bpb: 1.1124)

**3-seed mean val_bpb: 1.1124** (std=0.0008) | **15.4 MB artifact** | 8xH100 SXM, 600s training + 591s eval

## Results

| Seed | val_bpb (sliding, s64) | Artifact |
|------|------------------------|----------|
| 1337 | 1.1129 | 15,405,733 |
| 42 | 1.1128 | ~15.4M |
| 2024 | 1.1114 | ~15.4M |
| **Mean ± Std** | **1.1124 ± 0.0008** | |

## Approach

Standard 11L architecture, nothing exotic on the model side. The interesting part is the TTT. The base model trains for 600s, then TTT adapts all weights via SGD for 30 epochs on the validation data (score-first protocol).

The conventional wisdom is TTT at LR=0.002 for 3 epochs. We ran 20+ configurations on 4xH200 and found that cranking the LR to 1.0 and unfreezing every block turns a -0.0025 BPB technique into a -0.041 BPB technique. That's a 16x improvement from the same underlying method. It's like finding out your car has a sport mode you never tried.

## TTT Configuration

I swept this on 4xH200 before validating on 8xH100. The sweep told the whole story.

| Parameter | Our Value | PR #549 (merged SOTA) |
|-----------|-----------|----------------------|
| LR | 1.0 | 0.002 |
| Epochs | 30 | 3 |
| Freeze blocks | 0 (all unfrozen) | 0 |
| Momentum | 0.9 | 0.9 |
| TTT gain | -0.041 BPB | -0.0025 BPB |

### TTT LR Sweep (4xH200, 20 epochs, freeze=2)
| LR | Sliding BPB |
|----|------------|
| 0.01 | 1.1489 |
| 0.02 | 1.1471 |
| 0.05 | 1.1444 |
| 0.1 | 1.1422 |
| 0.2 | 1.1400 |
| 0.5 | 1.1351 |
| **0.7** | **1.1327** |
| 0.8 | 1.1355 |
| 1.0 | 1.1585 (diverged) |

BPB just keeps getting better as LR goes up... until it doesn't. Peak at 0.7 with 2 frozen blocks.

### Unfreezing all blocks (4xH200, 20 epochs)
| LR | freeze=2 | freeze=0 | Delta |
|----|----------|----------|-------|
| 0.7 | 1.1327 | 1.1255 | -0.007 |
| 1.0 | diverged | 1.1183 | — |
| **1.5** | **diverged** | **1.1110** | — |

This was the breakthrough. With 2 frozen blocks, LR=1.0 diverges. Unfreeze everything and it converges fine. The extra capacity from unfreezing absorbs the aggressive learning rate. It also shifts the optimal LR from 0.7 all the way up to 1.5.

### Epoch scaling (4xH200, LR=1.0, freeze=0)
| Epochs | Sliding BPB | TTT time |
|--------|------------|----------|
| 20 | 1.1183 | 569s |
| **30** | **1.1076** | **854s** |

On 8xH100, each TTT epoch runs in ~16.6s (vs 28.5s on 4xH200), so 30 epochs fits within the 10-minute eval budget.

## Architecture

| Component | Detail |
|-----------|--------|
| Layers | 11 |
| Dim | 512 |
| Heads | 8 (4 KV, GQA) |
| MLP | 3x, relu-squared |
| XSA | Last 4 layers |
| EMA | 0.997 |
| Late QAT | Int6 STE when lr_scale < 0.1 |
| Value Embeddings | 128-dim, 5 sets |
| BigramHash | 6144 buckets |
| SmearGate | Learned token blending |
| Warmdown | 1600 iterations |
| Seq length | 2048 (train), 1024 (eval) |
| Sliding window | stride=64 |
| Quantization | Int6 per-row + zstd-22 |

## Training

- Muon optimizer (matrix_lr=0.025, momentum=0.99 with warmup from 0.85)
- AdamW for embeddings/scalars (WD=0.04)
- Flash Attention v3 (Hopper) where available, SDPA fallback
- 6039 steps in 600s on 8xH100 (~99ms/step)

## Evaluation

Three phases, all within the 10-minute eval budget:

1. Int6+zstd quantization roundtrip
2. TTT: SGD(lr=1.0, momentum=0.9), 30 epochs, all blocks unfrozen, score-first
3. Sliding window eval (stride=64, seq_len=1024)

Total eval time: ~591s (TTT 497s + sliding window 92s + roundtrip 2s)

## Run Command

```bash
TTT_ENABLED=1 TTT_LR=1.0 TTT_EPOCHS=30 TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 \
VE_ENABLED=1 WARMDOWN_ITERS=1600 NUM_LAYERS=11 XSA_LAST_N=4 \
EMA_ENABLED=1 LATE_QAT=1 BIGRAM_VOCAB_SIZE=6144 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## How I Got Here

~20 hours on 4xH200, 54 experiments. Started from the 9L baseline and worked forward:

1. Baseline (9L, no extras): 1.1808
2. +11L, XSA, EMA, QAT: 1.1619
3. +Flash Attention v3: 1.1527
4. +Value Embeddings, warmdown tuning: 1.1521
5. +TTT (LR=0.01, 10ep, freeze=2): 1.1489
6. TTT LR sweep to 0.7: 1.1327
7. Unfreeze all blocks: 1.1255
8. LR=1.5, 20ep: 1.1110
9. 30ep, LR=1.0: 1.1076
10. 8xH100 (more training steps): **1.1124**

Step 7 was where it got fun. Everything before that was incremental hill climbing. Unfreezing all blocks during TTT changed the optimization landscape enough that learning rates that previously diverged started converging, and the whole curve shifted.

## Schrödinger's SOTA

This beats the merged leaderboard (1.1194) by 0.007 BPB. I haven't checked the pending PRs. Until they're merged, this is simultaneously a record and not a record, and I'm choosing to live in that superposition for a bit.

## Credits

Built on the community's collective work, especially PR #414 (signalrush), PR #461 (Christopher-Lee-McClendon), and PR #549 (abaybektursun).
130 changes: 130 additions & 0 deletions records/track_10min_16mb/2026-03-25_TTT_Aggressive_SGD/seed1337.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
W0325 17:22:48.042000 3334 torch/distributed/run.py:803]
W0325 17:22:48.042000 3334 torch/distributed/run.py:803] *****************************************
W0325 17:22:48.042000 3334 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 17:22:48.042000 3334 torch/distributed/run.py:803] *****************************************
logs/93d96033-aba6-425b-8a12-b5797527a51c.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27354201
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9322 val_bpb:4.1057 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9333 train_time:135ms step_avg:135.40ms
step:2/20000 train_loss:10.1819 train_time:208ms step_avg:103.88ms
step:3/20000 train_loss:8.5707 train_time:311ms step_avg:103.51ms
step:4/20000 train_loss:7.8164 train_time:405ms step_avg:101.17ms
step:5/20000 train_loss:7.1907 train_time:504ms step_avg:100.74ms
step:6/20000 train_loss:7.0114 train_time:603ms step_avg:100.56ms
step:7/20000 train_loss:6.9916 train_time:706ms step_avg:100.91ms
step:8/20000 train_loss:6.9090 train_time:812ms step_avg:101.52ms
step:9/20000 train_loss:6.4970 train_time:905ms step_avg:100.61ms
step:10/20000 train_loss:6.2038 train_time:1004ms step_avg:100.36ms
step:200/20000 train_loss:2.3810 train_time:20198ms step_avg:100.99ms
step:400/20000 train_loss:2.4305 train_time:42406ms step_avg:106.01ms
step:600/20000 train_loss:2.3302 train_time:60150ms step_avg:100.25ms
step:800/20000 train_loss:2.2242 train_time:79419ms step_avg:99.27ms
step:1000/20000 train_loss:2.2554 train_time:98822ms step_avg:98.82ms
step:1000/20000 val_loss:2.2063 val_bpb:1.3067 train_time:98845ms step_avg:98.85ms
step:1200/20000 train_loss:2.3216 train_time:119556ms step_avg:99.63ms
step:1400/20000 train_loss:2.1486 train_time:139903ms step_avg:99.93ms
step:1600/20000 train_loss:2.0417 train_time:159203ms step_avg:99.50ms
step:1800/20000 train_loss:2.1425 train_time:179656ms step_avg:99.81ms
step:2000/20000 train_loss:2.0560 train_time:198038ms step_avg:99.02ms
step:2000/20000 val_loss:2.1191 val_bpb:1.2551 train_time:198062ms step_avg:99.03ms
step:2200/20000 train_loss:2.1250 train_time:218099ms step_avg:99.14ms
step:2400/20000 train_loss:2.0585 train_time:235653ms step_avg:98.19ms
step:2600/20000 train_loss:2.1026 train_time:256551ms step_avg:98.67ms
step:2800/20000 train_loss:2.1529 train_time:276768ms step_avg:98.85ms
step:3000/20000 train_loss:2.1536 train_time:297220ms step_avg:99.07ms
step:3000/20000 val_loss:2.0918 val_bpb:1.2389 train_time:297258ms step_avg:99.09ms
step:3200/20000 train_loss:2.1706 train_time:318928ms step_avg:99.66ms
step:3400/20000 train_loss:2.0239 train_time:338681ms step_avg:99.61ms
step:3600/20000 train_loss:2.0969 train_time:358234ms step_avg:99.51ms
step:3800/20000 train_loss:2.0788 train_time:376669ms step_avg:99.12ms
step:4000/20000 train_loss:1.9838 train_time:397318ms step_avg:99.33ms
step:4000/20000 val_loss:2.0768 val_bpb:1.2300 train_time:397342ms step_avg:99.34ms
step:4200/20000 train_loss:2.1655 train_time:417491ms step_avg:99.40ms
step:4400/20000 train_loss:2.0535 train_time:436234ms step_avg:99.14ms
step:4600/20000 train_loss:1.8614 train_time:457455ms step_avg:99.45ms
step:4800/20000 train_loss:2.4414 train_time:475619ms step_avg:99.09ms
step:5000/20000 train_loss:2.1070 train_time:497156ms step_avg:99.43ms
step:5000/20000 val_loss:2.0260 val_bpb:1.1999 train_time:497194ms step_avg:99.44ms
step:5200/20000 train_loss:2.0378 train_time:515549ms step_avg:99.14ms
step:5400/20000 train_loss:2.0361 train_time:536193ms step_avg:99.30ms
step:5600/20000 train_loss:1.9364 train_time:555328ms step_avg:99.17ms
step:5800/20000 train_loss:1.9671 train_time:574408ms step_avg:99.04ms
late_qat:enabled step:5893 scale:0.0996
step:6000/20000 train_loss:1.9119 train_time:596711ms step_avg:99.45ms
step:6000/20000 val_loss:1.9455 val_bpb:1.1522 train_time:596738ms step_avg:99.46ms
step:6039/20000 val_loss:1.9441 val_bpb:1.1514 train_time:600008ms step_avg:99.36ms
stopping_early: wallclock_cap train_time:600008ms step:6039/20000
peak memory allocated: 20607 MiB reserved: 20654 MiB
ema:applying EMA weights
saved ema_checkpoint.pt
Serialized model: 106832383 bytes
Code size: 81771 bytes
Serialized model int6+zstd: 15323962 bytes
Total submission size int6+zstd: 15405733 bytes
ttt:start lr=1.0 momentum=0.9 epochs=30 freeze_blocks=0
ttt_epoch:1/30 loss:5.7241 time:16.9s
ttt_epoch:2/30 loss:5.9672 time:33.5s
ttt_epoch:3/30 loss:5.7740 time:50.1s
ttt_epoch:4/30 loss:5.4566 time:66.7s
ttt_epoch:5/30 loss:3.7144 time:83.3s
ttt_epoch:6/30 loss:2.4766 time:99.9s
ttt_epoch:7/30 loss:2.1376 time:116.5s
ttt_epoch:8/30 loss:1.9896 time:133.1s
ttt_epoch:9/30 loss:1.9714 time:149.7s
ttt_epoch:10/30 loss:2.0874 time:166.3s
ttt_epoch:11/30 loss:2.0820 time:182.9s
ttt_epoch:12/30 loss:2.0418 time:199.5s
ttt_epoch:13/30 loss:1.9634 time:216.1s
ttt_epoch:14/30 loss:1.9549 time:232.7s
ttt_epoch:15/30 loss:1.9497 time:249.3s
ttt_epoch:16/30 loss:1.9454 time:265.9s
ttt_epoch:17/30 loss:1.9419 time:282.6s
ttt_epoch:18/30 loss:1.9387 time:299.1s
ttt_epoch:19/30 loss:1.9359 time:315.6s
ttt_epoch:20/30 loss:1.9333 time:332.1s
ttt_epoch:21/30 loss:1.9308 time:348.7s
ttt_epoch:22/30 loss:1.9284 time:365.2s
ttt_epoch:23/30 loss:1.9262 time:381.7s
ttt_epoch:24/30 loss:1.9241 time:398.3s
ttt_epoch:25/30 loss:1.9222 time:414.8s
ttt_epoch:26/30 loss:1.9201 time:431.3s
ttt_epoch:27/30 loss:1.9183 time:447.8s
ttt_epoch:28/30 loss:1.9165 time:464.4s
ttt_epoch:29/30 loss:1.9145 time:480.9s
ttt_epoch:30/30 loss:1.9128 time:497.4s
ttt:done elapsed=497.4s
ttt:elapsed=497.4s
final_int6_roundtrip val_loss:1.9123 val_bpb:1.1326 eval_time:1922ms
final_int6_roundtrip_exact val_loss:1.91228463 val_bpb:1.13256267
final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:91736ms
final_int6_sliding_window_exact val_loss:1.87914996 val_bpb:1.11294140
130 changes: 130 additions & 0 deletions records/track_10min_16mb/2026-03-25_TTT_Aggressive_SGD/seed2024.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
W0325 18:09:10.743000 47278 torch/distributed/run.py:803]
W0325 18:09:10.743000 47278 torch/distributed/run.py:803] *****************************************
W0325 18:09:10.743000 47278 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 18:09:10.743000 47278 torch/distributed/run.py:803] *****************************************
logs/08bcdeaa-9436-442c-ba64-802e468cc64d.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27354201
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:2024
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9283 val_bpb:4.1033 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9298 train_time:136ms step_avg:136.19ms
step:2/20000 train_loss:9.8882 train_time:209ms step_avg:104.51ms
step:3/20000 train_loss:8.3873 train_time:296ms step_avg:98.73ms
step:4/20000 train_loss:7.7438 train_time:384ms step_avg:95.91ms
step:5/20000 train_loss:7.1677 train_time:471ms step_avg:94.12ms
step:6/20000 train_loss:6.9261 train_time:558ms step_avg:92.94ms
step:7/20000 train_loss:7.0807 train_time:646ms step_avg:92.23ms
step:8/20000 train_loss:6.9515 train_time:740ms step_avg:92.51ms
step:9/20000 train_loss:6.5535 train_time:838ms step_avg:93.06ms
step:10/20000 train_loss:6.2269 train_time:954ms step_avg:95.44ms
step:200/20000 train_loss:2.3709 train_time:18934ms step_avg:94.67ms
step:400/20000 train_loss:2.4110 train_time:39929ms step_avg:99.82ms
step:600/20000 train_loss:2.3238 train_time:57721ms step_avg:96.20ms
step:800/20000 train_loss:2.2141 train_time:76856ms step_avg:96.07ms
step:1000/20000 train_loss:2.2511 train_time:96331ms step_avg:96.33ms
step:1000/20000 val_loss:2.2028 val_bpb:1.3046 train_time:96376ms step_avg:96.38ms
step:1200/20000 train_loss:2.3247 train_time:118870ms step_avg:99.06ms
step:1400/20000 train_loss:2.1505 train_time:138805ms step_avg:99.15ms
step:1600/20000 train_loss:2.0428 train_time:158317ms step_avg:98.95ms
step:1800/20000 train_loss:2.1351 train_time:179370ms step_avg:99.65ms
step:2000/20000 train_loss:2.0526 train_time:199062ms step_avg:99.53ms
step:2000/20000 val_loss:2.1197 val_bpb:1.2554 train_time:199085ms step_avg:99.54ms
step:2200/20000 train_loss:2.1258 train_time:219886ms step_avg:99.95ms
step:2400/20000 train_loss:2.0584 train_time:237782ms step_avg:99.08ms
step:2600/20000 train_loss:2.1061 train_time:259097ms step_avg:99.65ms
step:2800/20000 train_loss:2.1589 train_time:280381ms step_avg:100.14ms
step:3000/20000 train_loss:2.1575 train_time:298165ms step_avg:99.39ms
step:3000/20000 val_loss:2.0933 val_bpb:1.2398 train_time:298191ms step_avg:99.40ms
step:3200/20000 train_loss:2.1760 train_time:318013ms step_avg:99.38ms
step:3400/20000 train_loss:2.0219 train_time:336253ms step_avg:98.90ms
step:3600/20000 train_loss:2.0957 train_time:357559ms step_avg:99.32ms
step:3800/20000 train_loss:2.0760 train_time:377699ms step_avg:99.39ms
step:4000/20000 train_loss:1.9864 train_time:398735ms step_avg:99.68ms
step:4000/20000 val_loss:2.0784 val_bpb:1.2309 train_time:398762ms step_avg:99.69ms
step:4200/20000 train_loss:2.1704 train_time:418022ms step_avg:99.53ms
step:4400/20000 train_loss:2.0550 train_time:436039ms step_avg:99.10ms
step:4600/20000 train_loss:1.8647 train_time:456177ms step_avg:99.17ms
step:4800/20000 train_loss:2.4504 train_time:475025ms step_avg:98.96ms
step:5000/20000 train_loss:2.1093 train_time:494156ms step_avg:98.83ms
step:5000/20000 val_loss:2.0304 val_bpb:1.2025 train_time:494180ms step_avg:98.84ms
step:5200/20000 train_loss:2.0425 train_time:511924ms step_avg:98.45ms
step:5400/20000 train_loss:2.0405 train_time:534622ms step_avg:99.00ms
step:5600/20000 train_loss:1.9377 train_time:553899ms step_avg:98.91ms
step:5800/20000 train_loss:1.9686 train_time:573023ms step_avg:98.80ms
late_qat:enabled step:5919 scale:0.0998
step:6000/20000 train_loss:1.9159 train_time:592251ms step_avg:98.71ms
step:6000/20000 val_loss:1.9486 val_bpb:1.1541 train_time:592274ms step_avg:98.71ms
step:6092/20000 val_loss:1.9445 val_bpb:1.1516 train_time:600009ms step_avg:98.49ms
stopping_early: wallclock_cap train_time:600009ms step:6092/20000
peak memory allocated: 20606 MiB reserved: 20656 MiB
ema:applying EMA weights
saved ema_checkpoint.pt
Serialized model: 106832383 bytes
Code size: 81771 bytes
Serialized model int6+zstd: 15299337 bytes
Total submission size int6+zstd: 15381108 bytes
ttt:start lr=1.0 momentum=0.9 epochs=30 freeze_blocks=0
ttt_epoch:1/30 loss:5.1426 time:16.9s
ttt_epoch:2/30 loss:3.4041 time:33.5s
ttt_epoch:3/30 loss:2.2817 time:50.1s
ttt_epoch:4/30 loss:2.1099 time:66.7s
ttt_epoch:5/30 loss:1.9857 time:83.3s
ttt_epoch:6/30 loss:1.9701 time:99.9s
ttt_epoch:7/30 loss:1.9618 time:116.5s
ttt_epoch:8/30 loss:1.9560 time:133.1s
ttt_epoch:9/30 loss:2.0294 time:149.7s
ttt_epoch:10/30 loss:2.0532 time:166.3s
ttt_epoch:11/30 loss:1.9803 time:183.0s
ttt_epoch:12/30 loss:1.9490 time:199.6s
ttt_epoch:13/30 loss:1.9441 time:216.2s
ttt_epoch:14/30 loss:1.9405 time:232.8s
ttt_epoch:15/30 loss:1.9373 time:249.4s
ttt_epoch:16/30 loss:1.9345 time:266.0s
ttt_epoch:17/30 loss:1.9320 time:282.6s
ttt_epoch:18/30 loss:1.9296 time:299.2s
ttt_epoch:19/30 loss:1.9274 time:315.8s
ttt_epoch:20/30 loss:1.9252 time:332.4s
ttt_epoch:21/30 loss:1.9232 time:349.0s
ttt_epoch:22/30 loss:1.9212 time:365.6s
ttt_epoch:23/30 loss:1.9193 time:382.3s
ttt_epoch:24/30 loss:1.9177 time:398.9s
ttt_epoch:25/30 loss:1.9158 time:415.5s
ttt_epoch:26/30 loss:1.9140 time:432.1s
ttt_epoch:27/30 loss:1.9124 time:448.7s
ttt_epoch:28/30 loss:1.9118 time:465.3s
ttt_epoch:29/30 loss:2.0171 time:481.9s
ttt_epoch:30/30 loss:1.9348 time:498.5s
ttt:done elapsed=498.6s
ttt:elapsed=498.6s
final_int6_roundtrip val_loss:1.9093 val_bpb:1.1308 eval_time:1930ms
final_int6_roundtrip_exact val_loss:1.90931604 val_bpb:1.13080451
final_int6_sliding_window val_loss:1.8766 val_bpb:1.1114 stride:64 eval_time:74088ms
final_int6_sliding_window_exact val_loss:1.87657671 val_bpb:1.11141737
Loading