Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions records/track_10min_16mb/2026-03-29_KitchenSinkV2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Kitchen Sink V2 — Improved

Built on PR #549 / KitchenSinkV2 with the following additions:

1. **Split early/late LR banks** — separate Muon and Adam optimizers for the first and second half of layers
2. **MiLe margin loss** — triangle-scheduled margin loss with gamma=0.75, clamp_min=0.2
3. **Cache + backout residual** — layer 7 output cached and mixed back via learnable gate
4. **LeakyReLU(0.5)²** activation in MLP
5. **XSA on last 7 layers** (up from default 4)
6. **Coprime-stride multi-shard data loader** (PR #726 / #1060 style)
7. **Train-data GPTQ int6 calibration** (PR #1060) — calibration uses training data within the training budget (14s reserved from 600s)
8. **Residual lambdas** — learnable per-sublayer residual scaling (init sqrt(1.1), 5x scalar LR, no weight decay)
9. **Bigger bigram hash** — 6144 buckets (up from 2048), reducing collision ratio
10. **Bigger value embeddings** — dim=196 on layers 5,9,10 (up from dim=128 on layers 9,10)
11. **Flash Attention 3** via flash_attn_interface
12. **Sliding window eval** with stride=64

## Results (12 seeds)

| Seed | val_loss (nats) | val_bpb |
|------|----------------|---------|
| 2 | 1.8793 | 1.1130 |
| 9999 | 1.8800 | 1.1134 |
| 22 | 1.8801 | 1.1135 |
| 7 | 1.8807 | 1.1139 |
| 1337 | 1.8808 | 1.1139 |
| 2222 | 1.8807 | 1.1139 |
| 99 | 1.8808 | 1.1139 |
| 77 | 1.8815 | 1.1143 |
| 2026 | 1.8814 | 1.1143 |
| 42 | 1.8817 | 1.1145 |
| 777 | 1.8818 | 1.1145 |
| 222 | 1.8820 | 1.1147 |

| Metric | val_loss (nats) | val_bpb |
|--------|----------------|---------|
| Mean | 1.8809 | 1.1140 |
| Std | 0.0008 | 0.0005 |

### Statistical significance

Current leader: 1.1194 bpb (~1.8901 nats).

- **Improvement: 0.0091 nats / 0.0054 bpb**
- One-sample t-test vs 0.005 nats improvement: t = -17.26, df = 11, **p < 0.0001**
- One-sample t-test vs 0.005 bpb improvement: t = -2.93, df = 11, **p = 0.007**

## Artifact size (worst-case, seed 777)

| Component | Bytes |
|-----------|-------|
| Model (int6+lzma) | 15,758,116 |
| Code | 126,292 |
| **Total** | **15,884,408** |

Under the 16,000,000 byte limit.

## Hyperparameters

| Parameter | Value |
|-----------|-------|
| MATRIX_LR (early) | 0.036 |
| MATRIX_LR (late) | 0.044 |
| SCALAR_LR (early) | 0.028 |
| SCALAR_LR (late) | 0.018 |
| TIED_EMBED_LR | 0.022 |
| TRAIN_BATCH_TOKENS | 548,864 |
| BIGRAM_VOCAB_SIZE | 6,144 |
| VE_DIM | 196 |
| VE_LAYERS | 5,9,10 |
| RESID_LAMBDA_INIT | sqrt(1.1) |
| RESID_LAMBDA_LR | 5x scalar_lr |

## Command

```bash
SEED=2 MATRIX_LR=0.036 MATRIX_LR_LATE=0.044 \
SCALAR_LR=0.028 SCALAR_LR_LATE=0.018 \
TIED_EMBED_LR=0.022 TRAIN_BATCH_TOKENS=548864 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
W0330 06:54:26.676000 777320 torch/distributed/run.py:851]
W0330 06:54:26.676000 777320 torch/distributed/run.py:851] *****************************************
W0330 06:54:26.676000 777320 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0330 06:54:26.676000 777320 torch/distributed/run.py:851] *****************************************
logs/a8570f39-0395-48c5-808c-9aefeca530ab.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27605108
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
gptq:reserving 14000ms from training budget, effective=586000ms
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9308 val_bpb:4.1048 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9247 train_time:110ms step_avg:109.71ms
step:2/20000 train_loss:6.7770 train_time:152ms step_avg:75.85ms
step:3/20000 train_loss:6.4930 train_time:214ms step_avg:71.50ms
step:4/20000 train_loss:6.5366 train_time:277ms step_avg:69.25ms
step:5/20000 train_loss:6.5985 train_time:340ms step_avg:67.91ms
step:6/20000 train_loss:6.4721 train_time:402ms step_avg:67.03ms
step:7/20000 train_loss:6.2092 train_time:464ms step_avg:66.33ms
step:8/20000 train_loss:5.9450 train_time:527ms step_avg:65.83ms
step:9/20000 train_loss:5.6933 train_time:589ms step_avg:65.46ms
step:10/20000 train_loss:5.4641 train_time:651ms step_avg:65.13ms
step:500/20000 train_loss:2.3845 train_time:31395ms step_avg:62.79ms
step:1000/20000 train_loss:2.3372 train_time:62873ms step_avg:62.87ms
step:1500/20000 train_loss:2.0842 train_time:94426ms step_avg:62.95ms
step:2000/20000 train_loss:2.0495 train_time:126006ms step_avg:63.00ms
step:2500/20000 train_loss:2.0005 train_time:157628ms step_avg:63.05ms
step:3000/20000 train_loss:2.0028 train_time:189254ms step_avg:63.08ms
step:3500/20000 train_loss:1.9619 train_time:220898ms step_avg:63.11ms
step:4000/20000 train_loss:1.9232 train_time:252542ms step_avg:63.14ms
step:4000/20000 val_loss:2.1009 val_bpb:1.2443 train_time:252564ms step_avg:63.14ms
step:4500/20000 train_loss:1.9716 train_time:284189ms step_avg:63.15ms
step:5000/20000 train_loss:1.9463 train_time:315859ms step_avg:63.17ms
step:5500/20000 train_loss:1.9674 train_time:347524ms step_avg:63.19ms
step:6000/20000 train_loss:1.9196 train_time:379162ms step_avg:63.19ms
step:6500/20000 train_loss:1.9817 train_time:410803ms step_avg:63.20ms
step:7000/20000 train_loss:1.9584 train_time:442437ms step_avg:63.21ms
step:7500/20000 train_loss:1.9635 train_time:474047ms step_avg:63.21ms
step:8000/20000 train_loss:2.0453 train_time:505674ms step_avg:63.21ms
step:8000/20000 val_loss:1.9886 val_bpb:1.1778 train_time:505697ms step_avg:63.21ms
swa:start step:8500
step:8500/20000 train_loss:1.9046 train_time:537273ms step_avg:63.21ms
late_qat:enabled step:8666 scale:0.1500
step:9000/20000 train_loss:1.9402 train_time:569449ms step_avg:63.27ms
step:9258/20000 val_loss:1.9168 val_bpb:1.1353 train_time:586070ms step_avg:63.30ms
stopping_early: wallclock_cap train_time:586070ms step:9258/20000
peak memory allocated: 15105 MiB reserved: 15554 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9152 val_bpb:1.1343 eval_time:2046ms
Serialized model: 107418558 bytes
Code size: 126292 bytes
gptq:building non-banked model for Hessian collection...
gptq:calibrating with 256 batches (train data)...
gptq:collected hessians for 68 layers (train data)
Serialized model int6+lzma: 15679992 bytes
Total submission size int6+lzma: 15806284 bytes
final_int6_roundtrip val_loss:1.9209 val_bpb:1.1376 eval_time:17594ms
final_int6_roundtrip_exact val_loss:1.92086865 val_bpb:1.13764661
final_int6_sliding_window val_loss:1.8808 val_bpb:1.1139 stride:64 eval_time:97835ms
final_int6_sliding_window_exact val_loss:1.88081447 val_bpb:1.11392722
87 changes: 87 additions & 0 deletions records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_2.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
W0330 07:09:28.436000 792140 torch/distributed/run.py:851]
W0330 07:09:28.436000 792140 torch/distributed/run.py:851] *****************************************
W0330 07:09:28.436000 792140 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0330 07:09:28.436000 792140 torch/distributed/run.py:851] *****************************************
logs/667b6924-e538-45f7-9318-3507e2f546ee.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27605108
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:2
gptq:reserving 14000ms from training budget, effective=586000ms
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9294 val_bpb:4.1040 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9241 train_time:109ms step_avg:108.97ms
step:2/20000 train_loss:6.7906 train_time:152ms step_avg:75.86ms
step:3/20000 train_loss:6.5068 train_time:214ms step_avg:71.21ms
step:4/20000 train_loss:6.4991 train_time:277ms step_avg:69.15ms
step:5/20000 train_loss:6.6625 train_time:339ms step_avg:67.77ms
step:6/20000 train_loss:6.5477 train_time:402ms step_avg:66.95ms
step:7/20000 train_loss:6.2685 train_time:464ms step_avg:66.24ms
step:8/20000 train_loss:6.0249 train_time:526ms step_avg:65.71ms
step:9/20000 train_loss:5.7601 train_time:588ms step_avg:65.35ms
step:10/20000 train_loss:5.5040 train_time:650ms step_avg:65.01ms
step:500/20000 train_loss:2.3872 train_time:31397ms step_avg:62.79ms
step:1000/20000 train_loss:2.3405 train_time:62872ms step_avg:62.87ms
step:1500/20000 train_loss:2.0889 train_time:94433ms step_avg:62.96ms
step:2000/20000 train_loss:2.0409 train_time:126034ms step_avg:63.02ms
step:2500/20000 train_loss:1.9906 train_time:157648ms step_avg:63.06ms
step:3000/20000 train_loss:1.9979 train_time:189269ms step_avg:63.09ms
step:3500/20000 train_loss:1.9596 train_time:220895ms step_avg:63.11ms
step:4000/20000 train_loss:1.9324 train_time:252536ms step_avg:63.13ms
step:4000/20000 val_loss:2.1005 val_bpb:1.2441 train_time:252559ms step_avg:63.14ms
step:4500/20000 train_loss:1.9775 train_time:284170ms step_avg:63.15ms
step:5000/20000 train_loss:1.9436 train_time:315811ms step_avg:63.16ms
step:5500/20000 train_loss:1.9671 train_time:347446ms step_avg:63.17ms
step:6000/20000 train_loss:1.9125 train_time:379056ms step_avg:63.18ms
step:6500/20000 train_loss:1.9808 train_time:410680ms step_avg:63.18ms
step:7000/20000 train_loss:1.9565 train_time:442304ms step_avg:63.19ms
step:7500/20000 train_loss:1.9611 train_time:473918ms step_avg:63.19ms
step:8000/20000 train_loss:2.0385 train_time:505525ms step_avg:63.19ms
step:8000/20000 val_loss:1.9871 val_bpb:1.1769 train_time:505549ms step_avg:63.19ms
swa:start step:8500
step:8500/20000 train_loss:1.9000 train_time:537140ms step_avg:63.19ms
late_qat:enabled step:8669 scale:0.1499
step:9000/20000 train_loss:1.9396 train_time:569204ms step_avg:63.24ms
step:9261/20000 val_loss:1.9156 val_bpb:1.1345 train_time:586035ms step_avg:63.28ms
stopping_early: wallclock_cap train_time:586035ms step:9261/20000
peak memory allocated: 15105 MiB reserved: 15554 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9139 val_bpb:1.1335 eval_time:2051ms
Serialized model: 107418558 bytes
Code size: 126292 bytes
gptq:building non-banked model for Hessian collection...
gptq:calibrating with 256 batches (train data)...
gptq:collected hessians for 68 layers (train data)
Serialized model int6+lzma: 15743224 bytes
Total submission size int6+lzma: 15869516 bytes
final_int6_roundtrip val_loss:1.9195 val_bpb:1.1369 eval_time:17520ms
final_int6_roundtrip_exact val_loss:1.91954305 val_bpb:1.13686152
final_int6_sliding_window val_loss:1.8793 val_bpb:1.1130 stride:64 eval_time:97387ms
final_int6_sliding_window_exact val_loss:1.87928859 val_bpb:1.11302350
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
W0330 09:07:18.516000 888790 torch/distributed/run.py:851]
W0330 09:07:18.516000 888790 torch/distributed/run.py:851] *****************************************
W0330 09:07:18.516000 888790 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0330 09:07:18.516000 888790 torch/distributed/run.py:851] *****************************************
logs/3311b6a2-03a4-4d56-9782-cd1fc50a9206.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27605108
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:2026
gptq:reserving 14000ms from training budget, effective=586000ms
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9290 val_bpb:4.1037 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9243 train_time:110ms step_avg:109.63ms
step:2/20000 train_loss:6.7677 train_time:152ms step_avg:76.04ms
step:3/20000 train_loss:6.5152 train_time:214ms step_avg:71.39ms
step:4/20000 train_loss:6.4932 train_time:276ms step_avg:69.09ms
step:5/20000 train_loss:6.5611 train_time:339ms step_avg:67.78ms
step:6/20000 train_loss:6.4592 train_time:401ms step_avg:66.82ms
step:7/20000 train_loss:6.1394 train_time:464ms step_avg:66.31ms
step:8/20000 train_loss:5.8797 train_time:526ms step_avg:65.75ms
step:9/20000 train_loss:5.6550 train_time:588ms step_avg:65.34ms
step:10/20000 train_loss:5.4873 train_time:651ms step_avg:65.10ms
step:500/20000 train_loss:2.3741 train_time:31431ms step_avg:62.86ms
step:1000/20000 train_loss:2.3251 train_time:62952ms step_avg:62.95ms
step:1500/20000 train_loss:2.0813 train_time:94551ms step_avg:63.03ms
step:2000/20000 train_loss:2.0445 train_time:126252ms step_avg:63.13ms
step:2500/20000 train_loss:1.9945 train_time:158000ms step_avg:63.20ms
step:3000/20000 train_loss:2.0020 train_time:189740ms step_avg:63.25ms
step:3500/20000 train_loss:1.9535 train_time:221468ms step_avg:63.28ms
step:4000/20000 train_loss:1.9343 train_time:253198ms step_avg:63.30ms
step:4000/20000 val_loss:2.1019 val_bpb:1.2448 train_time:253220ms step_avg:63.31ms
step:4500/20000 train_loss:1.9815 train_time:284945ms step_avg:63.32ms
step:5000/20000 train_loss:1.9464 train_time:316690ms step_avg:63.34ms
step:5500/20000 train_loss:1.9712 train_time:348459ms step_avg:63.36ms
step:6000/20000 train_loss:1.9147 train_time:380227ms step_avg:63.37ms
step:6500/20000 train_loss:1.9848 train_time:411966ms step_avg:63.38ms
step:7000/20000 train_loss:1.9573 train_time:443717ms step_avg:63.39ms
step:7500/20000 train_loss:1.9588 train_time:475440ms step_avg:63.39ms
step:8000/20000 train_loss:2.0435 train_time:507168ms step_avg:63.40ms
step:8000/20000 val_loss:1.9877 val_bpb:1.1772 train_time:507191ms step_avg:63.40ms
swa:start step:8450
step:8500/20000 train_loss:1.9047 train_time:538967ms step_avg:63.41ms
late_qat:enabled step:8639 scale:0.1499
step:9000/20000 train_loss:1.9395 train_time:571120ms step_avg:63.46ms
step:9231/20000 val_loss:1.9177 val_bpb:1.1358 train_time:586026ms step_avg:63.48ms
stopping_early: wallclock_cap train_time:586026ms step:9231/20000
peak memory allocated: 15105 MiB reserved: 15496 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9160 val_bpb:1.1348 eval_time:2047ms
Serialized model: 107418558 bytes
Code size: 126292 bytes
gptq:building non-banked model for Hessian collection...
gptq:calibrating with 256 batches (train data)...
gptq:collected hessians for 68 layers (train data)
Serialized model int6+lzma: 15625596 bytes
Total submission size int6+lzma: 15751888 bytes
final_int6_roundtrip val_loss:1.9216 val_bpb:1.1381 eval_time:17631ms
final_int6_roundtrip_exact val_loss:1.92160494 val_bpb:1.13808269
final_int6_sliding_window val_loss:1.8814 val_bpb:1.1143 stride:64 eval_time:97664ms
final_int6_sliding_window_exact val_loss:1.88143159 val_bpb:1.11429271
Loading