Skip to content

Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024

Open
immartian wants to merge 2 commits intoopenai:mainfrom
immartian:binding-ctw-improvement
Open

Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024
immartian wants to merge 2 commits intoopenai:mainfrom
immartian:binding-ctw-improvement

Conversation

@immartian
Copy link
Copy Markdown

@immartian immartian commented Mar 28, 2026

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

Hierarchical Dirichlet CTW mixing uses a fixed concentration c for all contexts:

p = (c * p_backup + count) / (c + ctx_count)

But contexts with 10,000 observations are much more reliable than contexts with 3 observations. Fixed c treats them the same.

The fix (one line, properly normalized)

c_eff = c_base / (1 + beta * log1p(ctx_count))

Critical: c_eff depends ONLY on ctx_count (total context observations), NOT on the target token's count. This ensures probabilities sum to 1 across all possible tokens. Using target-dependent concentration (e.g. fc/cc) breaks normalization — the same bug that invalidated many n-gram cache submissions (see #677).

How it works

              ctx_count    c_eff (b=10)    behavior
              ---------    ------------    --------
New context          3         0.083       ~fixed c (smooth toward backup)
Moderate           100         0.021       trust counts somewhat
Well-seen       10,000         0.011       trust counts strongly

More evidence -> lower concentration -> trust the observed counts.

Results on real FineWeb data

Causal scoring (no training pre-fill), 100K validation tokens:

Method All (bpt) Late 50K+ (bpt) Valid?
Fixed c=0.5 2.4023 0.6525 yes
Fixed c=0.1 2.3047 0.5763 yes
Fixed c=0.05 2.2928 0.5684 yes
Evidence c=0.1 b=10 2.2840 0.5630 yes
Certainty c=0.1 b=10 2.2838 0.5623 NO (target-dependent)

+0.38% overall, +0.94% on warmer cache vs best fixed concentration.

The improvement is small but honest:

  • Properly normalized (target-independent concentration)
  • Validated on real FineWeb data
  • Consistent across multiple hyperparameter settings

What we learned (honest accounting)

  1. Our initial IDF-based "binding energy" signal did NOT help on real data — token rarity is not the right signal for concentration
  2. Prediction certainty (fc/cc) helped dramatically (+35% on synthetic) but turns out to break normalization — same bug as PR Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean) #986
  3. The valid signal is simply evidence quantity: log(ctx_count). More observations = more reliable = lower concentration

The theoretically correct self-model is weaker than we hoped: "I know how much evidence I have" rather than "I know how informative my context is." But it's the claim that actually holds.

Files

File Description
binding_ctw.py Evidence-aware CTW module (properly normalized)
test/test_binding_ctw.py 19 tests
test/proof_fineweb_causal.py FineWeb benchmark script
test/proof_binding_beats_fixed.py Synthetic benchmark

Test plan

  • 19 unit tests passing
  • Synthetic benchmark (35% improvement, but uses invalid certainty signal)
  • FineWeb benchmark (+0.38% with valid evidence-only formula)
  • Normalization verified (c_eff is target-independent)

One-line change to hierarchical Dirichlet CTW mixing:
  c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context))

Instead of fixed c=5.0 for all contexts, adapt concentration based on
evidence strength (ctx_count) and context specificity (IDF):
  - High counts + rare context → low c → trust n-gram counts
  - Low counts + common context → c ≈ c_base → smooth toward backup

Results (synthetic two-regime corpus, 200K tokens):
  Fixed CTW (c=5.0):    1.0511 bits/token
  Binding CTW (c=c(B)): 0.6868 bits/token  (35% better)

Wins on both regimes:
  Rare deterministic:  0.976 vs 1.519 (+0.543 bpt)
  Common ambiguous:    0.720 vs 1.087 (+0.366 bpt)

19 tests + reproducible proof script included.
@immartian immartian changed the title Evidence-aware Dirichlet concentration — 35% improvement over fixed c=5.0 Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 Mar 28, 2026
@immartian
Copy link
Copy Markdown
Author

Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module (binding_ctw.py) is agnostic to the normalization backend — the evidence-aware concentration formula works with any properly normalized probability computation, not just hash-based caches.

The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method.

We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on.

…nce)

The certainty-based formula (fc/cc) created target-dependent concentration,
which breaks probability normalization — the same bug that invalidated
PR openai#986's n-gram caches.

Fixed formula: c_eff = c_base / (1 + beta * log1p(ctx_count))
This depends ONLY on ctx_count, identical for all possible next tokens.

Validated on real FineWeb data (causal, no training pre-fill):
  Best fixed (c=0.05):         2.2928 bpt
  Evidence-aware (c=0.1 b=10): 2.2840 bpt (+0.38%)
  Late positions:              0.5630 vs 0.5684 (+0.94%)

Small but honest improvement, properly normalized.
@immartian immartian changed the title Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized) Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant