openai · Gusanidas · Mar 30, 2026 · Mar 30, 2026
diff --git a/records/track_10min_16mb/2026-03-29_KitchenSinkV2/README.md b/records/track_10min_16mb/2026-03-29_KitchenSinkV2/README.md
@@ -0,0 +1,81 @@
+# Kitchen Sink V2 — Improved
+
+Built on PR #549 / KitchenSinkV2 with the following additions:
+
+1. **Split early/late LR banks** — separate Muon and Adam optimizers for the first and second half of layers
+2. **MiLe margin loss** — triangle-scheduled margin loss with gamma=0.75, clamp_min=0.2
+3. **Cache + backout residual** — layer 7 output cached and mixed back via learnable gate
+4. **LeakyReLU(0.5)²** activation in MLP
+5. **XSA on last 7 layers** (up from default 4)
+6. **Coprime-stride multi-shard data loader** (PR #726 / #1060 style)
+7. **Train-data GPTQ int6 calibration** (PR #1060) — calibration uses training data within the training budget (14s reserved from 600s)
+8. **Residual lambdas** — learnable per-sublayer residual scaling (init sqrt(1.1), 5x scalar LR, no weight decay)
+9. **Bigger bigram hash** — 6144 buckets (up from 2048), reducing collision ratio
+10. **Bigger value embeddings** — dim=196 on layers 5,9,10 (up from dim=128 on layers 9,10)
+11. **Flash Attention 3** via flash_attn_interface
+12. **Sliding window eval** with stride=64
+
+## Results (12 seeds)
+
+| Seed | val_loss (nats) | val_bpb |
+|------|----------------|---------|
+| 2 | 1.8793 | 1.1130 |
+| 9999 | 1.8800 | 1.1134 |
+| 22 | 1.8801 | 1.1135 |
+| 7 | 1.8807 | 1.1139 |
+| 1337 | 1.8808 | 1.1139 |
+| 2222 | 1.8807 | 1.1139 |
+| 99 | 1.8808 | 1.1139 |
+| 77 | 1.8815 | 1.1143 |
+| 2026 | 1.8814 | 1.1143 |
+| 42 | 1.8817 | 1.1145 |
+| 777 | 1.8818 | 1.1145 |
+| 222 | 1.8820 | 1.1147 |
+
+| Metric | val_loss (nats) | val_bpb |
+|--------|----------------|---------|
+| Mean | 1.8809 | 1.1140 |
+| Std | 0.0008 | 0.0005 |
+
+### Statistical significance
+
+Current leader: 1.1194 bpb (~1.8901 nats).
+
+- **Improvement: 0.0091 nats / 0.0054 bpb**
+- One-sample t-test vs 0.005 nats improvement: t = -17.26, df = 11, **p < 0.0001**
+- One-sample t-test vs 0.005 bpb improvement: t = -2.93, df = 11, **p = 0.007**
+
+## Artifact size (worst-case, seed 777)
+
+| Component | Bytes |
+|-----------|-------|
+| Model (int6+lzma) | 15,758,116 |
+| Code | 126,292 |
+| **Total** | **15,884,408** |
+
+Under the 16,000,000 byte limit.
+
+## Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| MATRIX_LR (early) | 0.036 |
+| MATRIX_LR (late) | 0.044 |
+| SCALAR_LR (early) | 0.028 |
+| SCALAR_LR (late) | 0.018 |
+| TIED_EMBED_LR | 0.022 |
+| TRAIN_BATCH_TOKENS | 548,864 |
+| BIGRAM_VOCAB_SIZE | 6,144 |
+| VE_DIM | 196 |
+| VE_LAYERS | 5,9,10 |
+| RESID_LAMBDA_INIT | sqrt(1.1) |
+| RESID_LAMBDA_LR | 5x scalar_lr |
+
+## Command
+
+```bash
+SEED=2 MATRIX_LR=0.036 MATRIX_LR_LATE=0.044 \
+SCALAR_LR=0.028 SCALAR_LR_LATE=0.018 \
+TIED_EMBED_LR=0.022 TRAIN_BATCH_TOKENS=548864 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_1337.log b/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_1337.log
@@ -0,0 +1,87 @@
+W0330 06:54:26.676000 777320 torch/distributed/run.py:851] 
+W0330 06:54:26.676000 777320 torch/distributed/run.py:851] *****************************************
+W0330 06:54:26.676000 777320 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0330 06:54:26.676000 777320 torch/distributed/run.py:851] *****************************************
+logs/a8570f39-0395-48c5-808c-9aefeca530ab.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:27605108
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
+train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+gptq:reserving 14000ms from training budget, effective=586000ms
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9308 val_bpb:4.1048 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9247 train_time:110ms step_avg:109.71ms
+step:2/20000 train_loss:6.7770 train_time:152ms step_avg:75.85ms
+step:3/20000 train_loss:6.4930 train_time:214ms step_avg:71.50ms
+step:4/20000 train_loss:6.5366 train_time:277ms step_avg:69.25ms
+step:5/20000 train_loss:6.5985 train_time:340ms step_avg:67.91ms
+step:6/20000 train_loss:6.4721 train_time:402ms step_avg:67.03ms
+step:7/20000 train_loss:6.2092 train_time:464ms step_avg:66.33ms
+step:8/20000 train_loss:5.9450 train_time:527ms step_avg:65.83ms
+step:9/20000 train_loss:5.6933 train_time:589ms step_avg:65.46ms
+step:10/20000 train_loss:5.4641 train_time:651ms step_avg:65.13ms
+step:500/20000 train_loss:2.3845 train_time:31395ms step_avg:62.79ms
+step:1000/20000 train_loss:2.3372 train_time:62873ms step_avg:62.87ms
+step:1500/20000 train_loss:2.0842 train_time:94426ms step_avg:62.95ms
+step:2000/20000 train_loss:2.0495 train_time:126006ms step_avg:63.00ms
+step:2500/20000 train_loss:2.0005 train_time:157628ms step_avg:63.05ms
+step:3000/20000 train_loss:2.0028 train_time:189254ms step_avg:63.08ms
+step:3500/20000 train_loss:1.9619 train_time:220898ms step_avg:63.11ms
+step:4000/20000 train_loss:1.9232 train_time:252542ms step_avg:63.14ms
+step:4000/20000 val_loss:2.1009 val_bpb:1.2443 train_time:252564ms step_avg:63.14ms
+step:4500/20000 train_loss:1.9716 train_time:284189ms step_avg:63.15ms
+step:5000/20000 train_loss:1.9463 train_time:315859ms step_avg:63.17ms
+step:5500/20000 train_loss:1.9674 train_time:347524ms step_avg:63.19ms
+step:6000/20000 train_loss:1.9196 train_time:379162ms step_avg:63.19ms
+step:6500/20000 train_loss:1.9817 train_time:410803ms step_avg:63.20ms
+step:7000/20000 train_loss:1.9584 train_time:442437ms step_avg:63.21ms
+step:7500/20000 train_loss:1.9635 train_time:474047ms step_avg:63.21ms
+step:8000/20000 train_loss:2.0453 train_time:505674ms step_avg:63.21ms
+step:8000/20000 val_loss:1.9886 val_bpb:1.1778 train_time:505697ms step_avg:63.21ms
+swa:start step:8500
+step:8500/20000 train_loss:1.9046 train_time:537273ms step_avg:63.21ms
+late_qat:enabled step:8666 scale:0.1500
+step:9000/20000 train_loss:1.9402 train_time:569449ms step_avg:63.27ms
+step:9258/20000 val_loss:1.9168 val_bpb:1.1353 train_time:586070ms step_avg:63.30ms
+stopping_early: wallclock_cap train_time:586070ms step:9258/20000
+peak memory allocated: 15105 MiB reserved: 15554 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9152 val_bpb:1.1343 eval_time:2046ms
+Serialized model: 107418558 bytes
+Code size: 126292 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:calibrating with 256 batches (train data)...
+gptq:collected hessians for 68 layers (train data)
+Serialized model int6+lzma: 15679992 bytes
+Total submission size int6+lzma: 15806284 bytes
+final_int6_roundtrip val_loss:1.9209 val_bpb:1.1376 eval_time:17594ms
+final_int6_roundtrip_exact val_loss:1.92086865 val_bpb:1.13764661
+final_int6_sliding_window val_loss:1.8808 val_bpb:1.1139 stride:64 eval_time:97835ms
+final_int6_sliding_window_exact val_loss:1.88081447 val_bpb:1.11392722
diff --git a/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_2.log b/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_2.log
@@ -0,0 +1,87 @@
+W0330 07:09:28.436000 792140 torch/distributed/run.py:851] 
+W0330 07:09:28.436000 792140 torch/distributed/run.py:851] *****************************************
+W0330 07:09:28.436000 792140 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0330 07:09:28.436000 792140 torch/distributed/run.py:851] *****************************************
+logs/667b6924-e538-45f7-9318-3507e2f546ee.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:27605108
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
+train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:2
+gptq:reserving 14000ms from training budget, effective=586000ms
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9294 val_bpb:4.1040 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9241 train_time:109ms step_avg:108.97ms
+step:2/20000 train_loss:6.7906 train_time:152ms step_avg:75.86ms
+step:3/20000 train_loss:6.5068 train_time:214ms step_avg:71.21ms
+step:4/20000 train_loss:6.4991 train_time:277ms step_avg:69.15ms
+step:5/20000 train_loss:6.6625 train_time:339ms step_avg:67.77ms
+step:6/20000 train_loss:6.5477 train_time:402ms step_avg:66.95ms
+step:7/20000 train_loss:6.2685 train_time:464ms step_avg:66.24ms
+step:8/20000 train_loss:6.0249 train_time:526ms step_avg:65.71ms
+step:9/20000 train_loss:5.7601 train_time:588ms step_avg:65.35ms
+step:10/20000 train_loss:5.5040 train_time:650ms step_avg:65.01ms
+step:500/20000 train_loss:2.3872 train_time:31397ms step_avg:62.79ms
+step:1000/20000 train_loss:2.3405 train_time:62872ms step_avg:62.87ms
+step:1500/20000 train_loss:2.0889 train_time:94433ms step_avg:62.96ms
+step:2000/20000 train_loss:2.0409 train_time:126034ms step_avg:63.02ms
+step:2500/20000 train_loss:1.9906 train_time:157648ms step_avg:63.06ms
+step:3000/20000 train_loss:1.9979 train_time:189269ms step_avg:63.09ms
+step:3500/20000 train_loss:1.9596 train_time:220895ms step_avg:63.11ms
+step:4000/20000 train_loss:1.9324 train_time:252536ms step_avg:63.13ms
+step:4000/20000 val_loss:2.1005 val_bpb:1.2441 train_time:252559ms step_avg:63.14ms
+step:4500/20000 train_loss:1.9775 train_time:284170ms step_avg:63.15ms
+step:5000/20000 train_loss:1.9436 train_time:315811ms step_avg:63.16ms
+step:5500/20000 train_loss:1.9671 train_time:347446ms step_avg:63.17ms
+step:6000/20000 train_loss:1.9125 train_time:379056ms step_avg:63.18ms
+step:6500/20000 train_loss:1.9808 train_time:410680ms step_avg:63.18ms
+step:7000/20000 train_loss:1.9565 train_time:442304ms step_avg:63.19ms
+step:7500/20000 train_loss:1.9611 train_time:473918ms step_avg:63.19ms
+step:8000/20000 train_loss:2.0385 train_time:505525ms step_avg:63.19ms
+step:8000/20000 val_loss:1.9871 val_bpb:1.1769 train_time:505549ms step_avg:63.19ms
+swa:start step:8500
+step:8500/20000 train_loss:1.9000 train_time:537140ms step_avg:63.19ms
+late_qat:enabled step:8669 scale:0.1499
+step:9000/20000 train_loss:1.9396 train_time:569204ms step_avg:63.24ms
+step:9261/20000 val_loss:1.9156 val_bpb:1.1345 train_time:586035ms step_avg:63.28ms
+stopping_early: wallclock_cap train_time:586035ms step:9261/20000
+peak memory allocated: 15105 MiB reserved: 15554 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9139 val_bpb:1.1335 eval_time:2051ms
+Serialized model: 107418558 bytes
+Code size: 126292 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:calibrating with 256 batches (train data)...
+gptq:collected hessians for 68 layers (train data)
+Serialized model int6+lzma: 15743224 bytes
+Total submission size int6+lzma: 15869516 bytes
+final_int6_roundtrip val_loss:1.9195 val_bpb:1.1369 eval_time:17520ms
+final_int6_roundtrip_exact val_loss:1.91954305 val_bpb:1.13686152
+final_int6_sliding_window val_loss:1.8793 val_bpb:1.1130 stride:64 eval_time:97387ms
+final_int6_sliding_window_exact val_loss:1.87928859 val_bpb:1.11302350
diff --git a/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_2026.log b/records/track_10min_16mb/2026-03-29_KitchenSinkV2/logs/seed_2026.log
@@ -0,0 +1,87 @@
+W0330 09:07:18.516000 888790 torch/distributed/run.py:851] 
+W0330 09:07:18.516000 888790 torch/distributed/run.py:851] *****************************************
+W0330 09:07:18.516000 888790 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0330 09:07:18.516000 888790 torch/distributed/run.py:851] *****************************************
+logs/3311b6a2-03a4-4d56-9782-cd1fc50a9206.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/alejandro/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/home/alejandro/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:27605108
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_7 active_layers:[4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.022 head_lr:0.0 matrix_lr:0.036 matrix_lr_late:0.044 scalar_lr:0.028 scalar_lr_late:0.018 leaky_slope:0.5
+train_batch_tokens:548864 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:2026
+gptq:reserving 14000ms from training budget, effective=586000ms
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9290 val_bpb:4.1037 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9243 train_time:110ms step_avg:109.63ms
+step:2/20000 train_loss:6.7677 train_time:152ms step_avg:76.04ms
+step:3/20000 train_loss:6.5152 train_time:214ms step_avg:71.39ms
+step:4/20000 train_loss:6.4932 train_time:276ms step_avg:69.09ms
+step:5/20000 train_loss:6.5611 train_time:339ms step_avg:67.78ms
+step:6/20000 train_loss:6.4592 train_time:401ms step_avg:66.82ms
+step:7/20000 train_loss:6.1394 train_time:464ms step_avg:66.31ms
+step:8/20000 train_loss:5.8797 train_time:526ms step_avg:65.75ms
+step:9/20000 train_loss:5.6550 train_time:588ms step_avg:65.34ms
+step:10/20000 train_loss:5.4873 train_time:651ms step_avg:65.10ms
+step:500/20000 train_loss:2.3741 train_time:31431ms step_avg:62.86ms
+step:1000/20000 train_loss:2.3251 train_time:62952ms step_avg:62.95ms
+step:1500/20000 train_loss:2.0813 train_time:94551ms step_avg:63.03ms
+step:2000/20000 train_loss:2.0445 train_time:126252ms step_avg:63.13ms
+step:2500/20000 train_loss:1.9945 train_time:158000ms step_avg:63.20ms
+step:3000/20000 train_loss:2.0020 train_time:189740ms step_avg:63.25ms
+step:3500/20000 train_loss:1.9535 train_time:221468ms step_avg:63.28ms
+step:4000/20000 train_loss:1.9343 train_time:253198ms step_avg:63.30ms
+step:4000/20000 val_loss:2.1019 val_bpb:1.2448 train_time:253220ms step_avg:63.31ms
+step:4500/20000 train_loss:1.9815 train_time:284945ms step_avg:63.32ms
+step:5000/20000 train_loss:1.9464 train_time:316690ms step_avg:63.34ms
+step:5500/20000 train_loss:1.9712 train_time:348459ms step_avg:63.36ms
+step:6000/20000 train_loss:1.9147 train_time:380227ms step_avg:63.37ms
+step:6500/20000 train_loss:1.9848 train_time:411966ms step_avg:63.38ms
+step:7000/20000 train_loss:1.9573 train_time:443717ms step_avg:63.39ms
+step:7500/20000 train_loss:1.9588 train_time:475440ms step_avg:63.39ms
+step:8000/20000 train_loss:2.0435 train_time:507168ms step_avg:63.40ms
+step:8000/20000 val_loss:1.9877 val_bpb:1.1772 train_time:507191ms step_avg:63.40ms
+swa:start step:8450
+step:8500/20000 train_loss:1.9047 train_time:538967ms step_avg:63.41ms
+late_qat:enabled step:8639 scale:0.1499
+step:9000/20000 train_loss:1.9395 train_time:571120ms step_avg:63.46ms
+step:9231/20000 val_loss:1.9177 val_bpb:1.1358 train_time:586026ms step_avg:63.48ms
+stopping_early: wallclock_cap train_time:586026ms step:9231/20000
+peak memory allocated: 15105 MiB reserved: 15496 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9160 val_bpb:1.1348 eval_time:2047ms
+Serialized model: 107418558 bytes
+Code size: 126292 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:calibrating with 256 batches (train data)...
+gptq:collected hessians for 68 layers (train data)
+Serialized model int6+lzma: 15625596 bytes
+Total submission size int6+lzma: 15751888 bytes
+final_int6_roundtrip val_loss:1.9216 val_bpb:1.1381 eval_time:17631ms
+final_int6_roundtrip_exact val_loss:1.92160494 val_bpb:1.13808269
+final_int6_sliding_window val_loss:1.8814 val_bpb:1.1143 stride:64 eval_time:97664ms
+final_int6_sliding_window_exact val_loss:1.88143159 val_bpb:1.11429271