Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized) by immartian · Pull Request #1024 · openai/parameter-golf

immartian · 2026-03-28T14:30:03Z

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

Hierarchical Dirichlet CTW mixing uses a fixed concentration c for all contexts:

p = (c * p_backup + count) / (c + ctx_count)

But contexts with 10,000 observations are much more reliable than contexts with 3 observations. Fixed c treats them the same.

The fix (one line, properly normalized)

c_eff = c_base / (1 + beta * log1p(ctx_count))

Critical: c_eff depends ONLY on ctx_count (total context observations), NOT on the target token's count. This ensures probabilities sum to 1 across all possible tokens. Using target-dependent concentration (e.g. fc/cc) breaks normalization — the same bug that invalidated many n-gram cache submissions (see #677).

How it works

              ctx_count    c_eff (b=10)    behavior
              ---------    ------------    --------
New context          3         0.083       ~fixed c (smooth toward backup)
Moderate           100         0.021       trust counts somewhat
Well-seen       10,000         0.011       trust counts strongly

More evidence -> lower concentration -> trust the observed counts.

Results on real FineWeb data

Causal scoring (no training pre-fill), 100K validation tokens:

Method	All (bpt)	Late 50K+ (bpt)	Valid?
Fixed c=0.5	2.4023	0.6525	yes
Fixed c=0.1	2.3047	0.5763	yes
Fixed c=0.05	2.2928	0.5684	yes
Evidence c=0.1 b=10	2.2840	0.5630	yes
~~Certainty c=0.1 b=10~~	~~2.2838~~	~~0.5623~~	NO (target-dependent)

+0.38% overall, +0.94% on warmer cache vs best fixed concentration.

The improvement is small but honest:

Properly normalized (target-independent concentration)
Validated on real FineWeb data
Consistent across multiple hyperparameter settings

What we learned (honest accounting)

Our initial IDF-based "binding energy" signal did NOT help on real data — token rarity is not the right signal for concentration
Prediction certainty (fc/cc) helped dramatically (+35% on synthetic) but turns out to break normalization — same bug as PR Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean) #986
The valid signal is simply evidence quantity: log(ctx_count). More observations = more reliable = lower concentration

The theoretically correct self-model is weaker than we hoped: "I know how much evidence I have" rather than "I know how informative my context is." But it's the claim that actually holds.

Files

File	Description
binding_ctw.py	Evidence-aware CTW module (properly normalized)
test/test_binding_ctw.py	19 tests
test/proof_fineweb_causal.py	FineWeb benchmark script
test/proof_binding_beats_fixed.py	Synthetic benchmark

Test plan

19 unit tests passing
Synthetic benchmark (35% improvement, but uses invalid certainty signal)
FineWeb benchmark (+0.38% with valid evidence-only formula)
Normalization verified (c_eff is target-independent)

One-line change to hierarchical Dirichlet CTW mixing: c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context)) Instead of fixed c=5.0 for all contexts, adapt concentration based on evidence strength (ctx_count) and context specificity (IDF): - High counts + rare context → low c → trust n-gram counts - Low counts + common context → c ≈ c_base → smooth toward backup Results (synthetic two-regime corpus, 200K tokens): Fixed CTW (c=5.0): 1.0511 bits/token Binding CTW (c=c(B)): 0.6868 bits/token (35% better) Wins on both regimes: Rare deterministic: 0.976 vs 1.519 (+0.543 bpt) Common ambiguous: 0.720 vs 1.087 (+0.366 bpt) 19 tests + reproducible proof script included.

immartian · 2026-03-28T19:37:08Z

Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module (binding_ctw.py) is agnostic to the normalization backend — the evidence-aware concentration formula works with any properly normalized probability computation, not just hash-based caches.

The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method.

We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on.

…nce) The certainty-based formula (fc/cc) created target-dependent concentration, which breaks probability normalization — the same bug that invalidated PR openai#986's n-gram caches. Fixed formula: c_eff = c_base / (1 + beta * log1p(ctx_count)) This depends ONLY on ctx_count, identical for all possible next tokens. Validated on real FineWeb data (causal, no training pre-fill): Best fixed (c=0.05): 2.2928 bpt Evidence-aware (c=0.1 b=10): 2.2840 bpt (+0.38%) Late positions: 0.5630 vs 0.5684 (+0.94%) Small but honest improvement, properly normalized.

immartian changed the title ~~Evidence-aware Dirichlet concentration — 35% improvement over fixed c=5.0~~ Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 Mar 28, 2026

immartian changed the title ~~Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0~~ Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized) Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024

Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024
immartian wants to merge 2 commits intoopenai:mainfrom
immartian:binding-ctw-improvement

immartian commented Mar 28, 2026 •

edited

Loading

Uh oh!

immartian commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

immartian commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

The fix (one line, properly normalized)

How it works

Results on real FineWeb data

What we learned (honest accounting)

Files

Test plan

Uh oh!

immartian commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

immartian commented Mar 28, 2026 •

edited

Loading