Non-record: 1x H100 SXM5 Explorations#1608
Open
User123331 wants to merge 70 commits intoopenai:mainfrom
Open
Conversation
Fix critical bugs: MoS params now included in optimizer groups, use NLL loss (not cross_entropy) since MoS returns log-probs, skip logit softcap for MoS path, re-normalize after LoRA correction. Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget. Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64 A/B comparison (10 min each on 1x H100). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Baseline bpb already known from prior runs (~1.2244). Saves 10 min of GPU time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First MoS pilot run. 1113 steps on 1xH100 SXM, 12.8MB artifact. Loss still dropping at wallclock cap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1hr run with MoS K=2 R=64 + WARMDOWN_ITERS=100 on 1xH100. Target: beat vanilla baseline val_bpb=1.2540 from PR#111. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Training now runs in background — safe to close terminal. Monitor with: tail -f /workspace/mos_1h_log.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…peed optimizations, SOTA plan - techniques_encyclopedia.md: 39 techniques catalog with bpb impacts and PR references - combination_matrix.md: Compatibility matrix (++/+/~/−) with stacking recommendations - speed_optimizations.md: Triton/FA3/fused kernels research for throughput gains - PLAN_beat_SOTA.md: Phase-by-phase implementation plan targeting <1.13 bpb MoS rejected after experiments showed +0.057 bpb worse than baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace train_gpt.py with thwu1's openai#1 implementation: - 10 layers, 3x MLP, BigramHash(10240), SmearGate - Mixed int5/int6 quantization, SWA, sliding eval - zstd-22 compression, magnitude pruning - Add custom tokenizer training pipeline: - run_custom_tokenizer_pipeline.sh: all-in-one script - data/train_tokenizer.py: SentencePiece trainer - Add run scripts: - run_competitive.sh: SOTA stack with default tokenizer - run_competitive_custom_tok.sh: SOTA stack with custom tokenizer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mixture of Softmax (K=2) output layer integrated with full SOTA technique stack: 11L Int6 + XSA4 + Partial RoPE + LN Scale + Tight SWA + VE128 + U-Net skips + Late QAT + SmearGate + BigramHash + FA3. - train_gpt_mos_sota.py: MoS class, FA3 soft fallback, nll_loss branch - run_mos_sota.sh: MODE=baseline|mos|smoke, auto FA3 selective build Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pings nvidia-smi every 60s in background to keep pod active during FA3 build and other CPU-only phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_gpt_mos_sota.py imports sentencepiece as spm at the top level; without it the script exits immediately on import. numpy is also used directly. Both are now checked and installed before training starts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pip copies the compiled .so into flash_attn_3/ relative to the hopper dir, but that subdir doesn't exist after a fresh clone. All kernels compiled successfully; only the final copy step was failing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default to disabled for stability on fresh environments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experiment Logs Score Sheet
Hardware: RunPod 1× NVIDIA H100 SXM5 80GB (Hopper SM90)
Dataset: FineWeb 10B tokens · SentencePiece BPE · seq_len=2048
Eval: Sliding window, stride=64, bits-per-byte (bpb) on val split
Budget: 600s wallclock per run (1×H100)
Files (.py scripts + logs): Google Drive
Full Experiment Ledger
Sorted by val bpb ascending (best first). Δbpb computed against reference run
11L_qk4_wd1500(1.2450).parallel_residualsngram_cachebaseline_fa3_buildint6_qatdepth_recurrence11L_qk4_wd150011L_qk4_wd1500_swa2511L_qk8_wd150011L_qk6_wd150011L_qk4_wd100011L_qk4_wd1500_mlr02510L_qk4_wd150012L_qk4_wd120010L_qk4_wd200011L_qk4_wd1500_tied005_mlr02511L_qk4_wd1500_tied005sp8192_full_stack11L_qk4_wd1500_fa3qk_gainnaive_baseline_9L_mlp2_seq1024mlp35baseline_10L11L_qk4_07034211layers11L_mlp35wd3500baseline_786kmega_164231xsa_ema11L_qk4_wd1200_fa3score_first_tttAblation Summary: Layer-Type Impact on Val bpb
Attention Mechanism
F.scaled_dot_product_attentionMLP Activation
ReLU(fc(x))→proj(x²)LeakyReLU(0.5, fc(x))→proj(x²)RoPE Coverage
Depth & Width
KV Heads (GQA Ratio)
Regularization
Depth Recurrence
Parallel Residuals
Embedding & Hash
Quantization & Compression
Test-Time Training (TTT)
Personal Tests Leaderboard
parallel_residualsngram_cachebaseline_fa3_buildint6_qatdepth_recurrence11L_qk4_wd150011L_qk4_wd1500_swa2511L_qk8_wd150011L_qk6_wd150011L_qk4_wd100011L_qk4_wd1500_mlr02510L_qk4_wd150012L_qk4_wd120010L_qk4_wd200011L_qk4_wd1500_fa3sp8192_full_stackxsa_emascore_first_ttt