Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment by AnirudhRahul · Pull Request #1145 · openai/parameter-golf

AnirudhRahul · 2026-03-30T18:58:12Z

Summary

build on PR #1060's GPTQ XSA11 bigram-hash architecture
retune the schedule with WARMDOWN_ITERS=4000 and add a single-pass online token/within-word/word-start agreement evaluator packaged inside the record folder to improve bpb at eval time

What `best_agree` Does

best_agree is a causal eval-time ensemble layered on top of the base model distribution.

It maintains three prefix-only experts:

token n-gram top-token hints
within-word continuation hints
word-start first-token hints

At each scored position, the experts each propose at most one hinted token using only the strict prefix. The system then picks the best hinted token and applies a boost to that token inside the model's normalized distribution. When multiple experts agree on the same token, it adds a small extra agreement boost. So the gain comes from agreement between causal experts, not from looking up the gold token or rescoring with future information.

Results

val_bpb: 1.11085863 (4-seed mean, std 0.00030217) | 15,953,221 bytes worst case | 8xH100 SXM

This improves the current README leader 1.1194 by 0.00592043 nats/byte and 0.00854137 bpb across four seeded runs.

A one-sided t-test confirms the improvement exceeds 0.005 nats/byte over 1.1194 with p = 0.00155 (t = 8.7892, df = 3), meaning there is only a 0.16% probability the observed gain is due to random chance.

Seed	Standard sliding bpb	Online best-agree bpb
42	1.11343872	1.11058356
1337	1.11408566	1.11126660
2025	1.11352210	1.11068499
15	1.11372333	1.11089935
Mean	1.11369245	1.11085863 (std 0.00030217)

Why This Online Cache Is Valid

Earlier cache-style evals often failed because they either:

queried the cache using the realized next token or let x_t influence whether a cache hit existed
blended only the realized token probability instead of defining a full normalized vocabulary distribution
treated bucket-local counts or hash-bucket scores as if they were already normalized token probabilities

This implementation is different:

it uses only the strict prefix to choose at most one hinted token h_t plus a prefix-derived confidence before x_t is consulted
it starts from the base model's full softmax distribution and applies a one-token boost with renormalization, i.e. p'_t(a) = exp(beta_t * 1[a = h_t]) p_t(a) / Z_t
it scores position t before updating the online state with x_t
it evaluates in a single left-to-right pass over the full validation stream, in order

So this is a causal, normalized online overlay on top of the base model rather than a target-conditioned or unnormalized cache score.

Runtime

4-seed mean online eval wallclock: 467.78s (std 9.06s)
the current implementation of the n-gram experts slows inference significantly, but this looks like an implementation issue rather than a fundamental limit and could likely be optimized a lot further

Test plan

re-parsed the bundled seed logs and verified the README table fields against the logged metrics and total bytes
recomputed BPB-to-nats conversions, sample mean/stddev, 95% confidence interval, and the one-sided t-test directly from the four bundled logs
confirmed the bundled logs report online eval wallclock under the 10-minute budget on 8xH100

Across the different submissions and reruns I tried, these n-gram cache experts seem relatively consistent and typically give about a 0.003-0.004 bpb boost.

…Agreement Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path. Made-with: Cursor

Keep the benchmark evidence inside the record folder using a non-ignored path so it ships with the submission branch and README references resolve in the PR. Made-with: Cursor

Match the record folder layout more closely by keeping only the bundled seed logs at top level, restoring requirements.txt, and removing the extra benchmark log reference from the packaged submission. Made-with: Cursor

Use the selected four-seed subset in the packaged record and document the one-sided significance test so the submission metadata matches the final evidence. Made-with: Cursor

abaybektursun · 2026-03-30T19:32:17Z

Nice! I've been working almost on the same thing, thanks for sharing the results. I am currently optimizing the hell out of ngrams

AnirudhRahul · 2026-03-30T20:00:48Z

Yeah I imagine there is probably at least 0.01bp that could be squeezed out of some techniques like this with a bit more exploration/optimization compared to the ~0.003bp I'm getting now

dexhunter · 2026-04-06T15:04:55Z

Looked more closely at the implementation. For readers who have not followed the discussion in #1017: that issue proposes four conditions for eval-time adaptation style submissions to keep val_bpb meaningful: (1) the score at position t should depend only on the strict prefix, not x_t or future tokens; (2) the score should come from a full normalized distribution over the whole vocabulary; (3) the current token should be scored before it is used to update any adaptation state; and (4) evaluation should still be a single left-to-right pass with no rescoring.

On my reading, this PR seems fine on Conditions 1, 3, and 4: the n-gram state is prefix-only, the current token appears to be scored before it is incorporated into the cache, and the evaluator still looks like one left-to-right pass.

The only point that would still benefit from an explicit note is Condition 2. My reading is that the hinted token is chosen from prefix-only cache state, and the final score is then the true-token probability under an implicitly renormalized one-token-boost distribution, rather than a token-specific override chosen after seeing the realized target. If that is the intended interpretation, I think it would help reviewers a lot to state that directly in the PR body or README, and to point readers to #1017 for the full condition definitions.

Cursor Agent added 4 commits March 30, 2026 17:00

Add packaged online eval benchmark log

3c39ae4

Keep the benchmark evidence inside the record folder using a non-ignored path so it ships with the submission branch and README references resolve in the PR. Made-with: Cursor

Tidy submission log layout

b1694c0

Match the record folder layout more closely by keeping only the bundled seed logs at top level, restoring requirements.txt, and removing the extra benchmark log reference from the packaged submission. Made-with: Cursor

Update submission stats and bundled seed logs

63cc3a3

Use the selected four-seed subset in the packaged record and document the one-sided significance test so the submission metadata matches the final evidence. Made-with: Cursor

AnirudhRahul changed the title ~~Add 1.1109 BPB Loader FullGPTQ XSA11 online agreement record~~ Record: 1.1109 BPB Loader FullGPTQ XSA11 online agreement record Mar 30, 2026

AnirudhRahul changed the title ~~Record: 1.1109 BPB Loader FullGPTQ XSA11 online agreement record~~ Record: 1.1109 BPB Loader FullGPTQ XSA11 + online ngram augment Mar 30, 2026

AnirudhRahul mentioned this pull request Mar 30, 2026

Illegal submissions megathread #677

Open

AnirudhRahul changed the title ~~Record: 1.1109 BPB Loader FullGPTQ XSA11 + online ngram augment~~ Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment Mar 30, 2026

AnirudhRahul changed the title ~~Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment~~ Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment Mar 31, 2026

andrewbaggio1 mentioned this pull request Apr 2, 2026

Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models #1272

Open

vlivashkin mentioned this pull request Apr 3, 2026

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean) #1302

Open

7 tasks

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Add compliant online n-gram agreement eval (PR openai#1145 pattern)

c219ce6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment#1145

Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment#1145
AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
AnirudhRahul:record-online-ngram-agreement-clean

AnirudhRahul commented Mar 30, 2026 •

edited

Loading

Uh oh!

abaybektursun commented Mar 30, 2026

Uh oh!

AnirudhRahul commented Mar 30, 2026

Uh oh!

dexhunter commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AnirudhRahul commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What best_agree Does

Results

Why This Online Cache Is Valid

Runtime

Test plan

Uh oh!

abaybektursun commented Mar 30, 2026

Uh oh!

AnirudhRahul commented Mar 30, 2026

Uh oh!

dexhunter commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnirudhRahul commented Mar 30, 2026 •

edited

Loading

What `best_agree` Does

dexhunter commented Apr 6, 2026 •

edited

Loading