Skip to content

[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.#1602

Open
SPThole wants to merge 14 commits intoopenai:mainfrom
SPThole:non_record_8
Open

[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.#1602
SPThole wants to merge 14 commits intoopenai:mainfrom
SPThole:non_record_8

Conversation

@SPThole
Copy link
Copy Markdown

@SPThole SPThole commented Apr 13, 2026

This is kind of a summary of (maybe not all, lol) the experimentation I did — lots of learning. Thanks to OpenAI for the $525 + a few hundred dollars of my own credits! I’d love to try more ideas I had but couldn’t due to a shortage of credits.

In this journey, I tried not to get bogged down by leaderboard approaches as much as possible. In a few places, though, when I got stuck, I did take help from the community. My general approach was: train a model → analyze it → try to solve the issues observed in the analysis. This ended up costing me many experiments and dollars. I used LLMs/Agents to great extent but kept myself at driver seat to direct the experimentations.

GIT REPO TO FIND ALL EXPERIMENTS:
https://github.com/SPThole/parameter-golf-experimentations and STRUCTURED_EXPSUM.md in PR

I have also made a cool mind map of all the experimentation — basically the path of what I did and why. I’ve also attached lineages that are relevant from community discussions and leaderboard files.

I am planning to build on this:
https://github.com/SPThole/bpb_wtf or visit: https://bpb-wtf.vercel.app/

image

I’m also building a broader direction around this (mind map + experiments). Basically, convert thinking pattern to graph and then embed it in the LLM context/or parameters so that it can follow thinking pattern. If this resonates with anyone or you’d like to collaborate, feel free to reach out — I’d love to explore this further together.

Competition: OpenAI Parameter Golf
Objective: Minimize validation loss (bits-per-byte, bpb) under a 16MB artifact constraint within 10-minute training on 8×H100
Total experiments: 119+
Most of them are on 1H100 few on 8XH100 (see logs)
Date range: Early 2026 — 2026-04-13


TLDR: Top Learnings Across All Phases

  1. Steps > everything else. More optimizer updates in the same wallclock matter alot.

  2. Depth recurrence is the best parameter-efficiency trick (community). 3-layer recurrence (blocks 3-5, 2 extra passes) from the community SP8192 baseline gives 17 virtual layers from 11 physical — the single biggest architectural win. Only works within the encoder, NOT across encoder/decoder boundary.

  3. SP8192 tokenizer is transformative (community). Community's jump from SP1024 to SP8192 unlocked ~0.04 bpb improvement. But the larger embedding table (8192×512) needs GPTQ with SDClip — naive int8+brotli gives 10× worse quant degradation.

  4. Parallel residuals improve quantization for free (community). GPT-J-style two-lane routing (attn/MLP read same input) from the community baseline collapses the quant gap vs single-lane. Cross-lane accumulation (community ImprovedParallelResiduals, PR Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523) pushed this further to 1.0744.

  5. Meta-TTT has an architecture-limited ceiling. 4 experiments (exp101, 105a, 106, 107) show identical TTT delta ~0.023 bpb regardless of inner-loop optimizer (SGD, MetaSGD, SAM, none). The ceiling is set by bank architecture, not training.

  6. Auxiliary losses are fatal in compute-starved regimes. JEPA, focal loss, boundary boost, MTP — every auxiliary objective tested hurt. With 1200-4700 steps, every gradient must directly reduce CE loss.

  7. Don't fight the optimizer. Muon's orthogonal constraint is a feature. VR_INIT must be 0.5 (lower → negative alphas). Embed LR ratio is misleading because Muon normalizes gradient direction. Progressive unfreezing prevents co-adaptation.

  8. Quantization improvements are free BPB. Per-row clip search (-25% quant error), int6 for MLP proj (3.4× less error), GPTQ with SDClip — all zero training cost. Always sweep AWQ alpha for each new best model.

  9. Simpler is better. Stripping token-type embedding and loss weighting from exp53b actually HELPED. Fewer competing objectives = better convergence in limited steps.

  10. QK_GAIN_INIT=5.25 is a free win (community). Monotonic improvement from 4.0→5.25 observed in the community SP8192 baseline. Per-head query gain initialization helps attention patterns specialize faster.

  11. Partial RoPE 16/64 is universally good. Frees 75% of head dims for semantic matching, reduces quantization outliers 3×, and improves word-start attention. Consistent across every experiment it was tested in.

  12. Word-start tokens dominate total loss. 25-40% of tokens but 42-66% of total loss. Mean loss 3.6-5.1 vs 1.2-1.6 for continuations. The best fix is architectural (partial RoPE), not loss manipulation (focal, weighting).

  13. Layer sharing revives dead blocks. Block 9 was dead at 6.1% effective rank. Sharing block 3 at position 9 revived it to 10.3%. Fewer unique blocks = smaller artifact = more headroom for params.

  14. Resid-norm is redundant with warmdown. Adding RMSNorm after skip connections improves quant but costs ~7ms/step (19 fewer training steps). With proper LR warmdown, weights are already smooth enough.

  15. Block sharing fails across encoder/decoder boundary. Shared blocks at decoder positions converge to near-zero scales — effectively dead. Soft gates correctly diagnose the problem but can't override it (exp109).

  16. The model tells you what it wants. Block 0 attention dies (structural, MLP-dominant). Block 8 ve_scale grows to 0.88 (wants identity in deep-layer values). Bigram scale decays 0.26→0.10 (attention supersedes local patterns). Listen to the learned parameters.

  17. Co-occurrence QK initialization works. Initializing W_Q/W_K from bigram SVD gives meaningful step-0 attention patterns instead of random noise. Validated at 1.3525 bpb on 1×H100.

  18. Warmdown timing is critical. warmdown=400 steps (start at step 900) gives 4 SWA checkpoints and proper LR decay. Too late (warmdown=200) → only 2 checkpoints. Community uses 3500-4000 iters on longer runs.

  19. Size budget is a hard constraint — check BEFORE celebrating. embed_dim=448 achieved great BPB (1.0877) but at 16.28MB — over the 16MB limit. embed_dim=416 similar story at 16.44MB. Multiple experiments wasted on approaches that couldn't fit.


Complete Experiment Index

Every experiment across all phases in one table.

# Experiment Base Motivation Result Learning
Phase 3a (exp00–exp18)
1 exp00 (baseline-rerun) exp27 Establish baseline on A100 Baseline Quant bpb 1.3389; bigram.proj has worst quant error
2 exp01b (ln-scale-only) exp27 Test layer-norm damping Not run
3 exp01c (ema-only) exp27 Test EMA weight averaging Not run
4 exp01d (xsa-only) exp27 Test cross-sequence attention Negative XSA slows steps without quality gain
5 exp02 (speed-bigramfp16-awq) exp00 FP16 bigram + per-category AWQ Negative FP16 bigram blows artifact to 17.3MB
6 exp03 (qat-ste) exp00 Quantization-aware training via STE Negative QAT-STE destabilizes training; worst result
7 exp04 (no-cyclic-momentum) exp00 Test fixed momentum=0.95 Negative Cyclic momentum is slightly helpful as regularization
8 exp05 (grad-accum4) exp00 Double step count via accum 8->4 Positive Major breakthrough: 2x more steps = first sub-1.32 quant bpb
9 exp06 (swa-awq-accum2) exp05 Push accum to 2; SWA+AWQ tuning Positive Raw bpb breaks below 1.30 for first time
10 exp07 (tighter-swa-awq) exp06 SWA_EVERY=150, AWQ=0.7 Neutral Smaller artifact, quant bpb identical; sweet spot is SWA_EVERY=100
11 exp08 (ctx-freq-bias) exp05 Learned token burstiness bias (+1 param) Neutral Redundant with attention; smallest artifact at 15.0MB
12 exp09 (padignore-wordboost) exp06 Skip pad tokens + word-start boost Positive Best quant bpb (1.3145); 0+1 params beat 688K trigram params
13 exp10 (trigram-unigram) exp09 Trigram hash table + unigram bias Neutral Best raw bpb but quant bpb regresses — extra params compress poorly
14 exp11 (trigram-slim-awq07) exp10 Slim trigram dim=48, AWQ=0.7 Negative dim=48 too small, AWQ too aggressive; double regression
15 exp12 (trigram64-awq06) exp10 Middle-ground trigram dim=64 Neutral Better than exp11 but worse than exp09; hash collisions too frequent
16 exp13 (multihead-gate-bigram) exp09 K=2 hash heads + context gate Positive Tied best quant bpb; collision reduction real but impact negligible
17 exp14 (engram-multiorder) exp13 1-5gram, 10 lookups/position Negative Shared n-gram embeddings cause destructive interference
18 exp15 (engram-3order) exp14 1-3gram with orthogonal subspaces Neutral Better isolation but each subspace too small (~42 dims)
19 exp16 (jepa-aux) exp15 JEPA predictor MLP, MSE loss Negative Biggest regression; fixed hash targets provide adversarial gradient
20 exp17 (byte-engram) exp16 Byte boundary features Negative No gain; base too weak to evaluate
21 exp18 (separate-trigram64) exp13 Separate 64-dim trigram + projection Neutral 688K extra params don't survive quantization
Phase 3b-Part1 (exp27b–exp33b)
22 exp27b (resid-norm) exp09 RMSNorm after skip connections Positive High-leverage: attacks root cause of quant error (norm growth 19.7->89.5)
23 exp28b (perlayer-quant) exp09 Variable bitwidth per layer Negative 16% MSE reduction but over 16MB budget
24 exp29b (lossweight-typemb) exp09 1.5x word-start loss + token-type embed Positive Gradient redistribution + structural signal both help
25 exp30b (combo) exp09 Stack all 3 validated improvements Positive Phase 3b-Part1 SOTA (1.3156); sub-additive but substantial
26 exp31b (rope-50k) exp30b RoPE base 10k->50k Negative Best raw bpb but quant gap widens; net negative after quantization
27 exp32b (aux-boundary) exp30b Auxiliary word-boundary classifier Negative Gradient waste; token-type already provides structural signal
28 exp33b (alt-rope-ntk) exp30b Alternating RoPE bases + NTK Neutral Marginal; positional loss still flat after 256 tokens
Phase 3b-Part2 (exp34b–exp48b)
29 exp34b (lr-schedule-fix) exp30b Fix ITERATIONS 20000->1300 so warmdown fires Positive Single biggest improvement (-0.0166 bpb); warmdown was never firing
30 exp35b (focal-loss) exp30b Focal loss gamma=2 Negative Too aggressive; suppresses easy token gradients
31 exp36b (cappedact-labelsmooth) exp30b Activation cap + label smoothing Negative Both changes hurt independently and together
32 exp37b (fused-cap) exp34b Activation cap only Negative Cap hurts raw quality more than it helps quant
33 exp38b (speed-opt) exp34b Speed optimization Neutral Failed (OOM)
34 exp39b (swa-tuning) exp34b SWA parameter sweep Positive SWA_EVERY=100 confirmed optimal
35 exp42b (revive-block9) exp34b Share block 3 at position 9 Positive Dead block 9 (6.1% rank) revived to 10.3%
36 exp43b (boundary-boost) exp42b Boundary loss boost Neutral Too sparse (2.5% of positions) to matter in 1200 steps
37 exp44b (seqlen-curriculum) exp42b Sequence length curriculum Negative Speed regression
38 exp45b (awq-alpha07) exp42b AWQ alpha sweep (post-train) Neutral Alpha=0.7 gave -0.007 bpb free on exp42b
39 exp46b (full-mha) exp42b 8 KV heads (double from 4) Neutral Extra params but slower; depth > width
40 exp47b (warmdown200) exp42b Shorter warmdown=200 Negative Too late; only 2 SWA checkpoints vs 4 with warmdown=400
41 exp48b (10blocks-depth) exp42b Add 10th unique block Positive Depth > width confirmed; quant bpb 1.2930
Phase 3b-Part3 (exp53b–clean_54b)
42 exp53b (lean-combo) exp48b Strip token-type + loss weighting Positive Removing features HELPED; quant bpb 1.2720 (-0.021!)
43 exp54b (xsa-zstd-ckfix) exp53b XSA last 2 layers + c_k fix + zstd Positive 1xH100 SOTA: quant bpb 1.2708
44 exp55b (scaled-xsa-all) exp54b Learned XSA alpha on all layers Neutral Model wants XSA everywhere (alpha=0.75-0.99) but 20ms overhead
45 exp56b (fast-cosine-xsa) exp55b Cosine-scale XSA approximation Negative GQA head expansion is bottleneck, not XSA math
46 exp57b (lora-ttt) exp54b LoRA-based TTT Negative Failed
47 exp58b (resid-norm-on) exp54b Re-enable resid-norm Negative Redundant with warmdown; 7ms/step overhead not worth it
48 exp59b (pre-norm-skip) exp54b Pre-skip normalization Negative Same overhead as full resid-norm, no quality difference
49 clean_54b (final-arch) exp54b Clean submission version + TTT Positive Quant bpb 1.2723; clean baseline
50 clean_54b_v2 (bf16-roundtrip) clean_54b BF16 roundtrip test Negative Destroyed quality
Phase 3.5 (exp60–exp80)
51 exp60 (8xh100-sim) exp54b EMA + flash_attn3 + 8xH100 simulation Neutral Infrastructure for scaling; not a bpb experiment
52 exp61b (xsa-all-warmdown) exp60 XSA all blocks + cosine warmdown Positive Pre-quant 1.1504; XSA-all works at scale
53 exp63 (cascade-vr) exp61b Cascading value residual + adaptive warmdown Positive Pre-quant 1.1377; discovered deep-layer value highway
54 exp64 (mlp-int6) exp63 MLP int6 quantization Not run Superseded by exp69
55 exp65 (quant-overhaul) exp63 Full quantization overhaul Not run Ideas flowed into exp69
56 exp66 (mile-nope) exp65 MiLe loss + partial NoPE Negative MiLe hurts early convergence
57 exp67 (ws-semantic-attn) exp66 Word-start semantic attention Negative Failed
58 exp68 (ws-mtp) exp66 Next-word-start MTP head Not run TTT data leakage concern
59 exp69 (better-quant) exp63 MLP proj->int6, attn->int5, LZMA, prune 5% Positive Closed quant gap 0.035->0.015; free improvements
60 exp70 (speed-opt) exp69 Batched NS5, EMA/10, set_to_none, deferred .item() Positive ~1.15 bpb; speed-optimized foundation for all subsequent
61 exp71 (output-bias) exp70 Output bias + label smooth + Z-loss Not run Needs too many steps to build momentum
62 exp72 (jepa-concept) exp70 JEPA concept loss Negative Added overhead, not enough steps even at 7K
63 exp73 (warmdown-focal) exp70 Warmdown focal + TTT weight Not run Safe late-training intervention (designed)
64 exp74 (prope-qgain-wbigram) exp70 Partial RoPE 16/64 + diverse q_gain + word bigram Positive Sliding bpb 1.1456; heads specialized (sharp+soft)
65 exp75 (word-pool) exp74 Inject previous word-start embedding Negative Model suppressed it (scale 0.1->0.002); redundant with attention
66 exp76 (dual-word-attn) exp74 Dual token + word attention Negative Failed
67 exp77old (late-warmdown) exp70 Late warmdown only Neutral Superseded by exp77
68 exp77 (progressive-batch) exp70 Progressive batch + seq_len curriculum Not run Theoretically sound but non-standard
69 exp78 (ws-loss-curriculum) exp70 Word-start loss curriculum 0.1->1.0 Positive Best embedding quality; WS rank improved
70 exp79 (position-ramp) exp70 Position ramp 1.0->1.2 + late WS boost Negative Premise wrong: late positions are EASIER (90% repeats)
71 exp80 (best-stack) exp70 Combine pRoPE + bigram-after-norm + pos ramp + clamp Negative Bigram-after-norm destabilized attention
Phase 3.6 (exp81–exp87)
72 exp81 (prope-ws-curriculum) exp78 Partial RoPE + WS curriculum Neutral Failed
73 exp82 (drop-layer10) exp81 Drop layer 10 + diverse q_gain Not run Designed only
74 exp83 (diagnostics) exp70 Full diagnostic run: grad norms, VR health, block analysis Positive 7 actionable insights; premature warmdown, dead blocks identified
76 exp84 (diagnostic-tuned) exp83 Apply diagnostics: VR_init=0.3, embed_lr=0.015 Negative VR went negative; embed_lr ratio misleading with Muon
77 exp85 (community-derived) exp83 pRoPE + x0-to-V + LN scale + clip search + small bigram Positive Best pre-quant (1.1517); ve_scale revealed model preferences
78 exp86 (deep-opt) exp85 Fused QKV + int8 critical + TF32 Not run Designed
79 exp87 (fast-convergence) exp85 Embed preinit SVD + progressive unfreeze + block9 AdamW Negative All 3 hurt; don't fight Muon's orthogonal constraint
Phase 3b-Muon (parallel optimizer)
80 exp70_parallel_muon exp70 Parallel Muon via reduce-scatter/all-gather overlap Positive 12% speed (658ms vs 750ms); same final bpb
81 exp70_vram_opt exp70_parallel_muon Double-buffer data loader Negative Insufficient buffers for grad_accum
82 exp70_cuda_fused exp70_parallel_muon CUDA Graphs + Triton fusion Negative No improvement
83 exp90 (copy-head) exp70_parallel_muon TopicCopyHead (hybrid freq+attn) Neutral Concept validated; 40ms overhead
84 reverted_exp70 exp70_parallel_muon Clean base with all fixes Positive Clean foundation; 656ms/step
85 exp91 (smooth-v0residual) reverted_exp70 V0 residual + label smoothing Neutral Pending validation
Phase 3c (exp92–exp109)
86 exp92 (banks-asyncmuon) exp70 Major rewrite: bank tensors + async Muon + partial RoPE + QAT + VE Positive ~1.131 bpb; paradigm shift in architecture
87 exp93 (meta-ttt) exp92 Meta-TTT inner/outer FOMAML Positive Legal_ttt ~1.116; first meta-TTT integration
88 exp95 (size-opt-metattt2x) exp93 Size optimization + meta-TTT 2x Positive Legal_ttt 1.1169; SOTA at the time
89 exp96 (warmdown-trigram) exp95 Warmdown fix + trigram hash Neutral ~1.135 bpb; marginal
90 exp97 (fp8-pipeline) exp96 FP8 pipeline + compile Not run Designed
91 exp98 (metattt-randomsplit) exp96 Random-split FOMAML + momentum LR match Neutral ~1.135 bpb; no improvement
92 exp99 (tripleloop) exp98 Triple loop + parallel residuals Not run Community merged first
93 exp100 (half-metattt) exp95 Half meta-TTT variant Neutral Not tracked in detail
94 exp101 (poscond-bigram) exp95 Position-conditional bigram hash by token class Positive Legal_ttt 1.11588; zero-param trick splitting hash by word-start
95 exp105a (no-metattt ablation) exp101 Remove meta-TTT to measure its contribution Neutral Meta-TTT = +0.00036 bpb (noise); ceiling is architectural
96 exp106 (metasgd-crosschunk) exp101 MetaSGD + cross-chunk FOMAML Neutral TTT delta invariant at ~0.023; ceiling confirmed
97 exp107 (sam-inner) exp106 SAM inner loop for TTT Negative SAM hurts; TTT delta still ~0.023 regardless of optimizer
98 exp108 (sp8192-brotli) exp106 SP8192 tokenizer + Brotli compression Neutral No stored results
99 exp109 (shared-blocks-softgate) exp101 Block sharing K=8 + soft gates + SP8192 Negative Decoder positions dead (near-zero scales); 10x worse quant
Community SOTA (SP8192+)
100 SP8192_3LayerRecur (community) Community SP8192 + 3-layer recurrence (blocks 3-5) + parallel residuals + QK_GAIN=5.25 Positive Legal_ttt 1.0808; paradigm shift — 17 virtual layers from 11 physical
101 WiderEmb_TapInV6_TTT (community) Community Wider loop (3x3) + per-pass embeddings + Tap-In V6 + legal TTT Positive Legal_ttt 1.0788 (3-seed mean 1.078825)
102 ImprovedParallelResiduals (community PR #1523) Community Cross-lane attn/MLP accumulation + CUTLASS EVT fusion Positive Legal_ttt 1.0744 — CURRENT BEST; 71 bytes headroom
103 RecurStepFiLM_PooledRetrieval (community) Community FiLM conditioning + pooled retrieval Neutral No improvement over base
104 10L_RecurStepFiLM_PooledRetrieval (community) Community 10L variant of FiLM+retrieval Neutral No improvement
105 newSota (community) Community Community SOTA integration Positive Integration checkpoint
106 11L_RecurStep3_loopedonly Community 11L, recurrence step 3, looped-only Neutral No improvement over ImprovedParallelResiduals
107 11L_RecurStep3_loops3 Community 11L with 3 loops Neutral No improvement
108 11L_RecurStep_StochDepth_ProgLoop Community Stochastic depth + progressive loop Neutral No improvement
109 11L_RecurStep_StochDepth_ProgLoop_KVCache Community + KV cache for recurrence Neutral No improvement
110 11L_Block10MLPHalf_RecurStepFiLM Community Block 10 MLP halved + FiLM + retrieval Neutral No improvement
111 loop_in_SP8192_3LayerRecur Community Loop detection: timestep embed + re-injection + per-loop RMSNorm Neutral Not yet trained
Frontier (exp110–exp119)
112 exp110 (perlayer-quant-trigram) ImprovedParallelResiduals Per-layer quant + trigram + PARALLEL_START=7 Neutral No improvement
113 exp111 (lora-ttt-shrunk) ImprovedParallelResiduals LoRA TTT rank=8 + shrunk block 10 MLP Neutral No improvement
114 exp112 (grad-rescaling) ImprovedParallelResiduals Gradient rescaling on weak blocks Negative Doesn't fix structural tied-embedding bottleneck
115 exp113 (drop-l0-mtp) ImprovedParallelResiduals Drop L0 MLP + batch schedule + MTP Neutral Truncated logs
116 exp114 (embed384-decouple) ImprovedParallelResiduals embed_dim=384 to decouple boundary blocks Negative 655K param loss -> BPB regression (1.0950)
117 exp115 (embed384-asymmetric) ImprovedParallelResiduals embed_dim=384 + drop boundary MLPs Neutral Truncated
118 exp116 (embed384-no-x0) ImprovedParallelResiduals embed_dim=384 + remove x0 pathway Negative No stored results
119 exp117 (embed448-tuned) ImprovedParallelResiduals embed_dim=448 to activate boundary blocks Negative Good BPB (1.0877) but 16.28MB — over budget
120 exp118 (embed416-parstart7) ImprovedParallelResiduals embed_dim=416 + parallel_start=7 + tighter clip Negative Good BPB (1.0915) but 16.44MB — over budget
121 exp119 (residual-lowrank-proj) ImprovedParallelResiduals Residual low-rank projection (rank=32) Neutral Theoretically correct fix; not run to completion
Misc
122 CooccurrenceQKInit PR #623 Init W_Q/W_K from bigram co-occurrence SVD Positive Val_bpb 1.3525 on 1xH100; meaningful step-0 attention patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant