Skip to content

Non-record: 1x H100 SXM5 Explorations#1608

Open
User123331 wants to merge 70 commits intoopenai:mainfrom
User123331:main
Open

Non-record: 1x H100 SXM5 Explorations#1608
User123331 wants to merge 70 commits intoopenai:mainfrom
User123331:main

Conversation

@User123331
Copy link
Copy Markdown

@User123331 User123331 commented Apr 14, 2026

Experiment Logs Score Sheet

Hardware: RunPod 1× NVIDIA H100 SXM5 80GB (Hopper SM90)
Dataset: FineWeb 10B tokens · SentencePiece BPE · seq_len=2048
Eval: Sliding window, stride=64, bits-per-byte (bpb) on val split
Budget: 600s wallclock per run (1×H100)
Files (.py scripts + logs): Google Drive


Full Experiment Ledger

Sorted by val bpb ascending (best first). Δbpb computed against reference run 11L_qk4_wd1500 (1.2450).

# Run Val bpb Δbpb Steps Layers [enc/dec] KV Heads MLP Act MLP Mult Attention RoPE QK Gain Warmdown Regularization Bigram Hash Quant Notes
1 parallel_residuals 1.2219 −0.023 2029 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 4096×64 INT5/INT6/brotli Asymmetric parallel attn+MLP from L7 (α_mlp=0.05, untied 2nd MLP)
2 ngram_cache 1.2289 −0.016 2268 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 2816×160 INT5/INT6/brotli Knuth hash, normal init
3 baseline_fa3_build 1.2293 −0.016 2265 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 4096×64 INT5/INT6/brotli LN Scale on norm input, per-optimizer WD (muon=0.095, adam=0.02, embed=0.085)
4 int6_qat 1.2304 −0.015 2258 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 4096×64 GPTQ-lite+INT6+zstd-22 Per-row 5-percentile MSE clip search
5 depth_recurrence 1.2339 −0.011 2041 11 [5/6] → 13 virt 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 4096×64 INT5/INT6/brotli L4–L5 looped ×2 shared weights at frac=0.50, 294ms/step
6 11L_qk4_wd1500 1.2450 REF 2074 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli Reference
7 11L_qk4_wd1500_swa25 1.2453 +0.000 2062 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.25,50) 10240×128 INT5/INT6/brotli Earlier SWA start
8 11L_qk8_wd1500 1.2453 +0.000 2069 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 8.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli QK gain=8.0
9 11L_qk6_wd1500 1.2460 +0.001 2064 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 6.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli QK gain=6.0
10 11L_qk4_wd1000 1.2464 +0.001 2066 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1000 SWA(0.4,50) 10240×128 INT5/INT6/brotli
11 11L_qk4_wd1500_mlr025 1.2464 +0.001 2060 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli matrix_lr=0.025
12 10L_qk4_wd1500 1.2467 +0.002 2282 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli
13 12L_qk4_wd1200 1.2478 +0.003 1889 12 [6/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1200 SWA(0.4,50) 10240×128 INT5/INT6/brotli 30.2M params
14 10L_qk4_wd2000 1.2490 +0.004 2273 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 2000 SWA(0.4,50) 10240×128 INT5/INT6/brotli
15 11L_qk4_wd1500_tied005_mlr025 1.2493 +0.004 2054 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli tied_lr=0.05, mlr=0.025
16 11L_qk4_wd1500_tied005 1.2496 +0.005 2071 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/brotli tied_embed_lr=0.05
17 sp8192_full_stack 1.2562 +0.011 1580 11 [5/6] → 16 virt 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 EMA(0.9965) 4096×64 SDClip+INT6/brotli SP8192 vocab, L3–5 recurrence ×3, projection XSA all layers, symmetric parallel resid L7, 380ms/step
18 11L_qk4_wd1500_fa3 1.2593 +0.014 1763 11 [5/6] 4 ReLU² 3.0 FA3 (autograd) Full 64 4.0 1500 SWA(0.4,50) 10240×128 INT5/INT6/zlib Best FA3-only
19 qk_gain 1.2625 +0.018 2268 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli env override tuning
20 naive_baseline_9L_mlp2_seq1024 1.2660 +0.021 2272 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli 9L, mlp3.0, seq2048
21 mlp35 1.2662 +0.021 2185 10 [5/5] 4 ReLU² 3.5 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli 28.1M params
22 baseline_10L 1.2666 +0.022 2271 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli
23 11L_qk4_070342 1.2686 +0.024 2064 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 4.0 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli Rerun of crashed #10
24 11layers 1.2694 +0.024 2057 11 [5/6] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli +1 layer (5 enc, 6 dec, 5 skip)
25 11L_mlp35 1.2740 +0.029 1973 11 [5/6] 4 ReLU² 3.5 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli 30.8M params
26 wd3500 1.2741 +0.029 2275 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3500 SWA(0.4,50) 10240×128 INT5/INT6/brotli Longer warmdown hurt
27 baseline_786k 1.5187 +0.274 887 10 [5/5] 4 ReLU² 3.0 SDPA(Flash) Full 64 1.5 3000 SWA(0.4,50) 10240×128 INT5/INT6/brotli batch=786k, undertrained
28 mega_164231 1.8872 +0.642 1854 10 [5/5] 2 LeakyReLU(0.5)² 3.0 FA3+XSA Partial 16 1.5 3000 EMA(0.997) 2048×64 INT5/INT6/brotli Fully-stacked test build
xsa_ema 0.9051 INVALID 1777 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Partial 16 5.25 frac=0.72 EMA(0.997) 4096×64 INT6/brotli Causal leakage bug, XSA mean-sub L7–L10, LN Scale on attn output
11L_qk4_wd1200_fa3 ~1.285 +0.040 1468 11 [5/6] 4 ReLU² 3.0 FA3 (autograd) Full 64 4.0 1200 SWA(0.4,50) 10240×128 INT5/INT6/zlib Partial, no final eval
score_first_ttt 2255 11 [5/6] 4 LeakyReLU(0.5)² 4.0 FA3/SDPA Full 64 5.25 frac=0.72 SWA(0.4,50) 4096×64 INT5/INT6/brotli TTT: 3-epoch SGD(lr=0.002, mom=0.9) per 32k chunk, crashed on eval

Ablation Summary: Layer-Type Impact on Val bpb

Attention Mechanism

Method Avg bpb Δbpb Verdict
SDPA(Flash) via F.scaled_dot_product_attention 1.245–1.274 Baseline
FA3 (primary) / SDPA (fallback) 1.222–1.234 −0.016 Better (confounded with other v2 changes)
FA3 raw op (no backward) crash Backward instability
FA3 autograd.Function (fwd+bwd) 1.259–1.285 +0.014 Faster but converges worse
FA3 + XSA mean-sub (pre-mask) 1.887 +0.642 Causal leak
FA3/SDPA + XSA mean-sub (pre-mask) 0.905 INVALID Causal leak
FA3 + projection-based XSA (all layers) 1.2562 +0.011 Correct but step-starved

MLP Activation

Method Avg bpb Δbpb Verdict
ReLU(fc(x))proj(x²) 1.245–1.274 Baseline
LeakyReLU(0.5, fc(x))proj(x²) 1.222–1.234 −0.016 Better

RoPE Coverage

Method Avg bpb Δbpb Verdict
Full-dim RoPE (all 64 head dims) 1.222–1.274 Baseline
Partial RoPE (first 16 of 64 dims) 0.905–1.887 confounded Always paired with XSA/bad configs

Depth & Width

Method Val bpb Δbpb Steps Verdict
10L [5/5], MLP 2.0 1.2660 +0.021 2272
10L [5/5], MLP 3.0 1.247–1.267 +0.002 2273
10L [5/5], MLP 3.5 1.2662 +0.021 2185
11L [5/6], MLP 3.0 1.245–1.269 REF 2069 Best zone
11L [5/6], MLP 3.5 1.2740 +0.029 1973 MLP 3.5 hurts on 11L
11L [5/6], MLP 4.0 1.222–1.234 −0.016 2265 Best with LeakyReLU
12L [6/6], MLP 3.0 1.2478 +0.003 1889 −185 steps

KV Heads (GQA Ratio)

KV Heads GQA Ratio Val bpb Δbpb Verdict
2 4:1 1.887 +0.642 Too sparse (confounded)
4 2:1 1.222–1.274 All runs

Regularization

Method Val bpb Δbpb Verdict
SWA (start=0.4, every=50) 1.245–1.274 Baseline
SWA (start=0.25, every=50) 1.2453 +0.000 Neutral
EMA (decay=0.997) 0.905–1.887 confounded
EMA (decay=0.9965) 1.2562 +0.011 Step-starved
Weight Decay=0.04 (Muon hardcoded 0.04, AdamW uses hyperparam) 1.245
Per-optimizer WD (muon=0.095, adam=0.02, embed=0.085) 1.229 −0.016 Better

Depth Recurrence

Method Val bpb Δbpb Steps ms/step Verdict
None 1.2293 2265 265
L4–L5 loop ×2 at frac=0.50 1.2339 +0.005 2041 294 −224 steps for +29ms/step
L3–L5 loop ×3 at frac=0.35 1.2562 +0.027 1580 380 Severe step starvation

Parallel Residuals

Method Val bpb Δbpb Verdict
Sequential (all layers) 1.229
Parallel attn+MLP from L7 (untied MLP, α_mlp) 1.2219 −0.007 Best bpb
Parallel attn+MLP from L7 (symmetric) 1.2562 +0.027 Step-starved

Embedding & Hash

Method Val bpb Δbpb Verdict
BigramHash 10240×128 (xor hash, zero-init) 1.245
BigramHash 2048×64 (xor hash, zero-init) 1.887 confounded
BigramHash 4096×64 (xor hash, zero-init) 1.222–1.234
BigramHash 2816×160 (Knuth hash, normal-init) 1.2289 −0.000 Neutral, submittable
Tied embed_lr=0.03 1.229
Tied embed_lr=0.05 1.249 +0.005 Worse

Quantization & Compression

Method Val bpb Δbpb Verdict
INT5(MLP)+INT6(Attn)+3% prune+brotli-11 1.2293
GPTQ-lite+INT6+zstd-22 1.2304 +0.001 Neutral
SDClip+INT6+brotli-11 1.2562 +0.027

Test-Time Training (TTT)

Method Val bpb Δbpb Verdict
Score-first TTT: SGD(lr=0.002, mom=0.9), 3 epochs, 32k chunk ~1.249 +0.020 LR too aggressive, degrades model

Personal Tests Leaderboard

Rank Run Val bpb Δ vs Ref Key Delta
1 parallel_residuals 1.2219 −0.023 Parallel attn+MLP from L7 (untied)
2 ngram_cache 1.2289 −0.016 BigramHash 2816×160 Knuth hash
3 baseline_fa3_build 1.2293 −0.016 LeakyReLU(0.5)², QK=5.25, per-optimizer WD, MLP 4.0
4 int6_qat 1.2304 −0.015 GPTQ-lite+zstd-22
5 depth_recurrence 1.2339 −0.011 L4–L5 recurrence ×2
6 11L_qk4_wd1500 1.2450 REF ReLU², QK=4.0, MLP 3.0, SWA
7 11L_qk4_wd1500_swa25 1.2453 +0.000 SWA start 0.25
8 11L_qk8_wd1500 1.2453 +0.000 QK gain=8.0
9 11L_qk6_wd1500 1.2460 +0.001 QK gain=6.0
10 11L_qk4_wd1000 1.2464 +0.001 Warmdown 1000
11 11L_qk4_wd1500_mlr025 1.2464 +0.001 matrix_lr=0.025
12 10L_qk4_wd1500 1.2467 +0.002 10 layers
13 12L_qk4_wd1200 1.2478 +0.003 12 layers
14 10L_qk4_wd2000 1.2490 +0.004 10L, wd=2000
15 11L_qk4_wd1500_fa3 1.2593 +0.014 FA3 autograd
16 sp8192_full_stack 1.2562 +0.011 Full PR#1493 clone, step-starved
xsa_ema 0.9051 INVALID Causal leakage bug
score_first_ttt ~1.249 FAILED SGD lr too aggressive

Billy Endson and others added 30 commits March 21, 2026 02:34
Fix critical bugs: MoS params now included in optimizer groups,
use NLL loss (not cross_entropy) since MoS returns log-probs,
skip logit softcap for MoS path, re-normalize after LoRA correction.
Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget.

Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64
A/B comparison (10 min each on 1x H100).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Baseline bpb already known from prior runs (~1.2244).
Saves 10 min of GPU time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First MoS pilot run. 1113 steps on 1xH100 SXM, 12.8MB artifact.
Loss still dropping at wallclock cap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1hr run with MoS K=2 R=64 + WARMDOWN_ITERS=100 on 1xH100.
Target: beat vanilla baseline val_bpb=1.2540 from PR#111.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Training now runs in background — safe to close terminal.
Monitor with: tail -f /workspace/mos_1h_log.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…peed optimizations, SOTA plan

- techniques_encyclopedia.md: 39 techniques catalog with bpb impacts and PR references
- combination_matrix.md: Compatibility matrix (++/+/~/−) with stacking recommendations
- speed_optimizations.md: Triton/FA3/fused kernels research for throughput gains
- PLAN_beat_SOTA.md: Phase-by-phase implementation plan targeting <1.13 bpb

MoS rejected after experiments showed +0.057 bpb worse than baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace train_gpt.py with thwu1's openai#1 implementation:
  - 10 layers, 3x MLP, BigramHash(10240), SmearGate
  - Mixed int5/int6 quantization, SWA, sliding eval
  - zstd-22 compression, magnitude pruning

- Add custom tokenizer training pipeline:
  - run_custom_tokenizer_pipeline.sh: all-in-one script
  - data/train_tokenizer.py: SentencePiece trainer

- Add run scripts:
  - run_competitive.sh: SOTA stack with default tokenizer
  - run_competitive_custom_tok.sh: SOTA stack with custom tokenizer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mixture of Softmax (K=2) output layer integrated with full SOTA technique
stack: 11L Int6 + XSA4 + Partial RoPE + LN Scale + Tight SWA + VE128 +
U-Net skips + Late QAT + SmearGate + BigramHash + FA3.

- train_gpt_mos_sota.py: MoS class, FA3 soft fallback, nll_loss branch
- run_mos_sota.sh: MODE=baseline|mos|smoke, auto FA3 selective build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pings nvidia-smi every 60s in background to keep pod active during
FA3 build and other CPU-only phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_gpt_mos_sota.py imports sentencepiece as spm at the top level;
without it the script exits immediately on import. numpy is also used
directly. Both are now checked and installed before training starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pip copies the compiled .so into flash_attn_3/ relative to the hopper
dir, but that subdir doesn't exist after a fresh clone. All kernels
compiled successfully; only the final copy step was failing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default to disabled for stability on fresh environments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant