openai · greqone · Mar 26, 2026 · Mar 26, 2026
diff --git a/records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/README.md b/records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/README.md
@@ -0,0 +1,75 @@
+# Non-record (WIP): Multi-Order N-gram Backoff + Entropy-Adaptive Alpha
+
+**Status: WIP** — validated on 1xH100 SXM proxy run, pending 8xH100 SXM verification for official record.
+
+**Proxy val_bpb = 0.8004** (1xH100, 876 steps, 59% eval coverage) | **15.18 MB** | Seed 42
+
+## Summary
+
+Fork of PR #828 approach (10L + Multi-Order N-gram Backoff) with `MATRIX_LR=0.03`. The n-gram backoff eval cache provides massive BPB improvement over the neural-only model by mixing model predictions with backward-looking n-gram statistics at eval time.
+
+## 1xH100 Proxy Results
+
+| Metric | Value |
+|--------|-------|
+| Training steps | 876 (1xH100, 600s wall clock) |
+| Pre-quant val_bpb | 1.3796 |
+| **N-gram eval BPB** | **0.8004** |
+| Artifact size | 15.18 MB |
+| Eval coverage | 59.4% (570s failsafe) |
+| N-gram orders | 2-7, entropy-adaptive alpha |
+
+**Note**: This is a proxy run on 1xH100 with only 876 training steps (vs ~7000 on 8xH100). The base model quality (1.38 BPB) is significantly weaker than what 8xH100 would produce (~1.15 BPB). On 8xH100, we expect the final n-gram BPB to be ~0.90-0.92, consistent with PR #828's reported 0.9076.
+
+## Architecture
+
+- 10L, 512d, GQA 8H/4KV, MLP 3x LeakyReLU(0.5)^2
+- BigramHash(4096, dim=128), SmearGate, Value Residual, Gated Attention
+- XSA last 4 layers, Partial RoPE 16/64, LN Scale
+- U-Net skip connections, tied embeddings, logit softcap=30
+
+## Training
+
+- Muon optimizer: lr=0.03, momentum 0.92 to 0.99, WD=0.04
+- EMA(0.997), warmdown=3500 steps
+- Mixed int5-MLP/int6-attn quantization + zstd-22
+- 3% magnitude pruning
+
+## Eval: Multi-Order N-gram Backoff
+
+- Score-first backward-looking n-gram cache (orders 2-7)
+- Highest matching order wins (backoff from 7-gram to bigram)
+- Entropy-adaptive alpha: `alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))`
+- 4M XOR-hash buckets, min_count=2
+- **Legal**: each token scored BEFORE cache is updated (Issue #402 compliant)
+
+## Compliance
+
+- [x] Score-first: tokens scored before n-gram cache update
+- [x] No pre-eval TTT or adaptation
+- [x] No val tokens in artifact
+- [x] Artifact under 16 MB (15.18 MB)
+- [x] Training under 600s wall clock
+- [x] Eval under 570s (failsafe)
+
+## Reproduction
+
+```bash
+# 1xH100 proxy (validated):
+MATRIX_LR=0.03 SEED=42 torchrun --standalone --nproc_per_node=1 train_gpt.py
+
+# 8xH100 official (pending compute access):
+MATRIX_LR=0.03 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Next Steps
+
+- [ ] 8xH100 SXM verification run (3 seeds for statistical significance)
+- [ ] Explore frozen n-gram oracle + learned gate (PR #834 approach)
+- [ ] Higher-order n-grams (orders 2-9)
+- [ ] Complementary training loss weighting
+
+## Based On
+
+- PR #828 (@bigbag): 10L + Multi-Order N-gram Backoff (0.9076 BPB)
+- PR #802: Original n-gram backoff implementation
diff --git a/records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_1xh100_proxy.log b/records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_1xh100_proxy.log
@@ -0,0 +1,95 @@
+logs/c775137f-aa05-456a-ad13-4085aa0d4019.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:24730705
+world_size:1 grad_accum_steps:8
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.03 matrix_lr:0.03 scalar_lr:0.02
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:5 max_wallclock_seconds:600.000
+seed:42
+warmup_step:1/5
+warmup_step:2/5
+warmup_step:3/5
+warmup_step:4/5
+warmup_step:5/5
+step:1/20000 train_loss:6.9301 train_time:740ms step_avg:740.18ms
+step:2/20000 train_loss:7.9979 train_time:1423ms step_avg:711.48ms
+step:3/20000 train_loss:7.7429 train_time:2104ms step_avg:701.37ms
+step:4/20000 train_loss:7.2865 train_time:2787ms step_avg:696.72ms
+step:5/20000 train_loss:6.8208 train_time:3470ms step_avg:694.00ms
+step:6/20000 train_loss:6.4930 train_time:4152ms step_avg:692.05ms
+step:7/20000 train_loss:6.2159 train_time:4835ms step_avg:690.67ms
+step:8/20000 train_loss:6.0152 train_time:5517ms step_avg:689.66ms
+step:9/20000 train_loss:5.8839 train_time:6200ms step_avg:688.92ms
+step:10/20000 train_loss:5.7774 train_time:6887ms step_avg:688.69ms
+step:100/20000 train_loss:3.4297 train_time:68505ms step_avg:685.05ms
+step:200/20000 train_loss:2.8498 train_time:137061ms step_avg:685.31ms
+step:300/20000 train_loss:2.6534 train_time:205578ms step_avg:685.26ms
+step:400/20000 train_loss:2.5760 train_time:274072ms step_avg:685.18ms
+step:500/20000 train_loss:2.4309 train_time:342565ms step_avg:685.13ms
+step:600/20000 train_loss:2.3835 train_time:411070ms step_avg:685.12ms
+step:700/20000 train_loss:2.4191 train_time:479587ms step_avg:685.12ms
+step:800/20000 train_loss:2.3649 train_time:548238ms step_avg:685.30ms
+step:876/20000 val_loss:2.3293 val_bpb:1.3796 train_time:600321ms step_avg:685.30ms
+stopping_early: wallclock_cap train_time:600321ms step:876/20000
+peak memory allocated: 21118 MiB reserved: 21334 MiB
+ema:applying shadow model
+Serialized model: 96864555 bytes
+Code size: 68444 bytes
+Total submission size: 96932999 bytes
+Serialized model int6+zstd: 15114383 bytes
+Total submission size: 15182827 bytes (15.18 MB)
+SIZE CHECK PASSED: 15.18 MB < 16.00 MB
+final_eval_mode:sliding_ngram orders=2-7 alpha=0.4 entropy=True stride:64
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+  ngram_eval [  1.3%] bpb=1.435111 t=22s
+  ngram_eval [  2.6%] bpb=1.402530 t=35s
+  ngram_eval [  4.0%] bpb=1.361849 t=47s
+  ngram_eval [  5.3%] bpb=1.320140 t=60s
+  ngram_eval [  6.6%] bpb=1.279959 t=72s
+  ngram_eval [  7.9%] bpb=1.241292 t=84s
+  ngram_eval [  9.3%] bpb=1.207860 t=97s
+  ngram_eval [ 10.6%] bpb=1.176798 t=109s
+  ngram_eval [ 11.9%] bpb=1.147068 t=121s
+  ngram_eval [ 13.2%] bpb=1.119773 t=134s
+  ngram_eval [ 14.5%] bpb=1.094718 t=146s
+  ngram_eval [ 15.9%] bpb=1.070849 t=158s
+  ngram_eval [ 17.2%] bpb=1.049488 t=171s
+  ngram_eval [ 18.5%] bpb=1.029457 t=183s
+  ngram_eval [ 19.8%] bpb=1.012787 t=195s
+  ngram_eval [ 21.1%] bpb=0.996054 t=207s
+  ngram_eval [ 22.5%] bpb=0.980828 t=220s
+  ngram_eval [ 23.8%] bpb=0.966282 t=232s
+  ngram_eval [ 25.1%] bpb=0.953015 t=244s
+  ngram_eval [ 26.4%] bpb=0.941106 t=256s
+  ngram_eval [ 27.7%] bpb=0.930342 t=268s
+  ngram_eval [ 29.1%] bpb=0.920125 t=281s
+  ngram_eval [ 30.4%] bpb=0.910740 t=293s
+  ngram_eval [ 31.7%] bpb=0.902184 t=305s
+  ngram_eval [ 33.0%] bpb=0.894142 t=317s
+  ngram_eval [ 34.3%] bpb=0.886139 t=329s
+  ngram_eval [ 35.7%] bpb=0.878789 t=341s
+  ngram_eval [ 37.0%] bpb=0.871667 t=353s
+  ngram_eval [ 38.3%] bpb=0.865602 t=366s
+  ngram_eval [ 39.6%] bpb=0.859789 t=378s
+  ngram_eval [ 41.0%] bpb=0.854720 t=390s
+  ngram_eval [ 42.3%] bpb=0.849776 t=402s
+  ngram_eval [ 43.6%] bpb=0.845097 t=414s
+  ngram_eval [ 44.9%] bpb=0.840780 t=426s
+  ngram_eval [ 46.2%] bpb=0.836903 t=438s
+  ngram_eval [ 47.6%] bpb=0.833217 t=450s
+  ngram_eval [ 48.9%] bpb=0.829542 t=462s
+  ngram_eval [ 50.2%] bpb=0.826147 t=474s
+  ngram_eval [ 51.5%] bpb=0.822454 t=486s
+  ngram_eval [ 52.8%] bpb=0.819000 t=498s
+  ngram_eval [ 54.2%] bpb=0.815742 t=511s
+  ngram_eval [ 55.5%] bpb=0.812363 t=523s
+  ngram_eval [ 56.8%] bpb=0.809277 t=535s
+  ngram_eval [ 58.1%] bpb=0.806136 t=547s
+  ngram_eval [ 59.4%] bpb=0.802990 t=559s
+  FAILSAFE: ngram eval time 570s exceeds budget
+  ngram_eval DONE: bpb=0.800415 tokens=37619648 t=570s
+  WARNING: eval used 570s of 570.0s budget — results may be from partial coverage
+final_int8_zlib_roundtrip val_loss:1.3482 val_bpb:0.8004 eval_time:570471ms
+final_int8_zlib_roundtrip_exact val_loss:1.34824811 val_bpb:0.80041516