[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.#1602
Open
SPThole wants to merge 14 commits intoopenai:mainfrom
Open
[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.#1602SPThole wants to merge 14 commits intoopenai:mainfrom
SPThole wants to merge 14 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is kind of a summary of (maybe not all, lol) the experimentation I did — lots of learning. Thanks to OpenAI for the $525 + a few hundred dollars of my own credits! I’d love to try more ideas I had but couldn’t due to a shortage of credits.
In this journey, I tried not to get bogged down by leaderboard approaches as much as possible. In a few places, though, when I got stuck, I did take help from the community. My general approach was: train a model → analyze it → try to solve the issues observed in the analysis. This ended up costing me many experiments and dollars. I used LLMs/Agents to great extent but kept myself at driver seat to direct the experimentations.
GIT REPO TO FIND ALL EXPERIMENTS:
https://github.com/SPThole/parameter-golf-experimentations and STRUCTURED_EXPSUM.md in PR
I have also made a cool mind map of all the experimentation — basically the path of what I did and why. I’ve also attached lineages that are relevant from community discussions and leaderboard files.
I am planning to build on this:
https://github.com/SPThole/bpb_wtf or visit: https://bpb-wtf.vercel.app/
I’m also building a broader direction around this (mind map + experiments). Basically, convert thinking pattern to graph and then embed it in the LLM context/or parameters so that it can follow thinking pattern. If this resonates with anyone or you’d like to collaborate, feel free to reach out — I’d love to explore this further together.
TLDR: Top Learnings Across All Phases
Steps > everything else. More optimizer updates in the same wallclock matter alot.
Depth recurrence is the best parameter-efficiency trick (community). 3-layer recurrence (blocks 3-5, 2 extra passes) from the community SP8192 baseline gives 17 virtual layers from 11 physical — the single biggest architectural win. Only works within the encoder, NOT across encoder/decoder boundary.
SP8192 tokenizer is transformative (community). Community's jump from SP1024 to SP8192 unlocked ~0.04 bpb improvement. But the larger embedding table (8192×512) needs GPTQ with SDClip — naive int8+brotli gives 10× worse quant degradation.
Parallel residuals improve quantization for free (community). GPT-J-style two-lane routing (attn/MLP read same input) from the community baseline collapses the quant gap vs single-lane. Cross-lane accumulation (community ImprovedParallelResiduals, PR Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523) pushed this further to 1.0744.
Meta-TTT has an architecture-limited ceiling. 4 experiments (exp101, 105a, 106, 107) show identical TTT delta ~0.023 bpb regardless of inner-loop optimizer (SGD, MetaSGD, SAM, none). The ceiling is set by bank architecture, not training.
Auxiliary losses are fatal in compute-starved regimes. JEPA, focal loss, boundary boost, MTP — every auxiliary objective tested hurt. With 1200-4700 steps, every gradient must directly reduce CE loss.
Don't fight the optimizer. Muon's orthogonal constraint is a feature. VR_INIT must be 0.5 (lower → negative alphas). Embed LR ratio is misleading because Muon normalizes gradient direction. Progressive unfreezing prevents co-adaptation.
Quantization improvements are free BPB. Per-row clip search (-25% quant error), int6 for MLP proj (3.4× less error), GPTQ with SDClip — all zero training cost. Always sweep AWQ alpha for each new best model.
Simpler is better. Stripping token-type embedding and loss weighting from exp53b actually HELPED. Fewer competing objectives = better convergence in limited steps.
QK_GAIN_INIT=5.25 is a free win (community). Monotonic improvement from 4.0→5.25 observed in the community SP8192 baseline. Per-head query gain initialization helps attention patterns specialize faster.
Partial RoPE 16/64 is universally good. Frees 75% of head dims for semantic matching, reduces quantization outliers 3×, and improves word-start attention. Consistent across every experiment it was tested in.
Word-start tokens dominate total loss. 25-40% of tokens but 42-66% of total loss. Mean loss 3.6-5.1 vs 1.2-1.6 for continuations. The best fix is architectural (partial RoPE), not loss manipulation (focal, weighting).
Layer sharing revives dead blocks. Block 9 was dead at 6.1% effective rank. Sharing block 3 at position 9 revived it to 10.3%. Fewer unique blocks = smaller artifact = more headroom for params.
Resid-norm is redundant with warmdown. Adding RMSNorm after skip connections improves quant but costs ~7ms/step (19 fewer training steps). With proper LR warmdown, weights are already smooth enough.
Block sharing fails across encoder/decoder boundary. Shared blocks at decoder positions converge to near-zero scales — effectively dead. Soft gates correctly diagnose the problem but can't override it (exp109).
The model tells you what it wants. Block 0 attention dies (structural, MLP-dominant). Block 8 ve_scale grows to 0.88 (wants identity in deep-layer values). Bigram scale decays 0.26→0.10 (attention supersedes local patterns). Listen to the learned parameters.
Co-occurrence QK initialization works. Initializing W_Q/W_K from bigram SVD gives meaningful step-0 attention patterns instead of random noise. Validated at 1.3525 bpb on 1×H100.
Warmdown timing is critical. warmdown=400 steps (start at step 900) gives 4 SWA checkpoints and proper LR decay. Too late (warmdown=200) → only 2 checkpoints. Community uses 3500-4000 iters on longer runs.
Size budget is a hard constraint — check BEFORE celebrating. embed_dim=448 achieved great BPB (1.0877) but at 16.28MB — over the 16MB limit. embed_dim=416 similar story at 16.44MB. Multiple experiments wasted on approaches that couldn't fit.
Complete Experiment Index
Every experiment across all phases in one table.