Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models#1272
Open
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models#1272andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
Conversation
…ong models ~30 experiments showing n-grams, OLB, prime adapters, complementary training, and TTT all fail to improve well-trained GPTQ'd models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A collection of things that don't work on well-trained GPTQ'd models.
My coding agents and I ran ~30 experiments over the past two weeks trying
to push below the 1.11 BPP frontier. Most of them failed. This documents
the failures so others don't repeat them.
Updates and extends my earlier negative results PR (#1186).
Eval-time techniques that don't work on strong models
These all work on weak/undertrained models but provide zero benefit once
your base model is well-trained with Full Hessian GPTQ + sliding window
eval:
The n-gram normalization proof
I built the best possible legal n-gram cache: Kneser-Ney smoothing with
exact trie counts (zero hashing, zero collisions), order 7, full
normalized distribution over all 1024 tokens at every position.
Results on 500K positions:
The entire 0.09-0.97 BPP improvement from hashed n-gram caches was a
measurement artifact from unnormalized distributions. The real signal from
properly normalized n-grams is 0.001-0.003 BPP, so it's not worth the complexity.
SLOT violates causal dependence
Detailed in my PR #1240. 100% violation rate across 240 tested pairs.
Self-prediction advantage: +0.24 nats (shared delta), +0.73 nats
(per-sample). Every SLOT-based result on the leaderboard is suspect.
Scylla tokenizer doesn't help (with correct accounting)
Covered in my other PR. With corrected byte accounting, Scylla gets 1.1289
BPP, the same as SP1024 at 1.1157. The entire sub-1.0 claim was a byte
accounting bug in
candidate.meta.npz.What actually matters
After all these experiments, the model quality is dominated by:
Files included
ngram_test.py— Kneser-Ney trie with full normalization proofonline_logit_bias.py— Online logit bias implementation + synthetic testcorrect_meta.npz— Corrected Scylla byte accountingretokenize_proper.py— Proper retokenization with official train/val split