Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment#1145
Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment#1145AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
Conversation
…Agreement Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path. Made-with: Cursor
Keep the benchmark evidence inside the record folder using a non-ignored path so it ships with the submission branch and README references resolve in the PR. Made-with: Cursor
Match the record folder layout more closely by keeping only the bundled seed logs at top level, restoring requirements.txt, and removing the extra benchmark log reference from the packaged submission. Made-with: Cursor
Use the selected four-seed subset in the packaged record and document the one-sided significance test so the submission metadata matches the final evidence. Made-with: Cursor
|
Nice! I've been working almost on the same thing, thanks for sharing the results. I am currently optimizing the hell out of ngrams |
|
Yeah I imagine there is probably at least 0.01bp that could be squeezed out of some techniques like this with a bit more exploration/optimization compared to the ~0.003bp I'm getting now |
|
Looked more closely at the implementation. For readers who have not followed the discussion in #1017: that issue proposes four conditions for eval-time adaptation style submissions to keep On my reading, this PR seems fine on Conditions 1, 3, and 4: the n-gram state is prefix-only, the current token appears to be scored before it is incorporated into the cache, and the evaluator still looks like one left-to-right pass. The only point that would still benefit from an explicit note is Condition 2. My reading is that the hinted token is chosen from prefix-only cache state, and the final score is then the true-token probability under an implicitly renormalized one-token-boost distribution, rather than a token-specific override chosen after seeing the realized target. If that is the intended interpretation, I think it would help reviewers a lot to state that directly in the PR body or README, and to point readers to #1017 for the full condition definitions. |
Summary
WARMDOWN_ITERS=4000and add a single-pass online token/within-word/word-start agreement evaluator packaged inside the record folder to improve bpb at eval timeWhat
best_agreeDoesbest_agreeis a causal eval-time ensemble layered on top of the base model distribution.It maintains three prefix-only experts:
At each scored position, the experts each propose at most one hinted token using only the strict prefix. The system then picks the best hinted token and applies a boost to that token inside the model's normalized distribution. When multiple experts agree on the same token, it adds a small extra agreement boost. So the gain comes from agreement between causal experts, not from looking up the gold token or rescoring with future information.
Results
val_bpb: 1.11085863(4-seed mean, std0.00030217) |15,953,221bytes worst case | 8xH100 SXMThis improves the current README leader
1.1194by0.00592043 nats/byteand0.00854137 bpbacross four seeded runs.A one-sided t-test confirms the improvement exceeds
0.005 nats/byteover1.1194withp = 0.00155(t = 8.7892,df = 3), meaning there is only a 0.16% probability the observed gain is due to random chance.Why This Online Cache Is Valid
Earlier cache-style evals often failed because they either:
x_tinfluence whether a cache hit existedThis implementation is different:
h_tplus a prefix-derived confidence beforex_tis consultedp'_t(a) = exp(beta_t * 1[a = h_t]) p_t(a) / Z_ttbefore updating the online state withx_tSo this is a causal, normalized online overlay on top of the base model rather than a target-conditioned or unnormalized cache score.
Runtime
467.78s(std9.06s)Test plan
Across the different submissions and reruns I tried, these n-gram cache experts seem relatively consistent and typically give about a
0.003-0.004 bpbboost.