Partition Function Inflation: Why Hashed N-Gram Caches Produce Invalid BPB Scores (Non-Record, Analytical)#1147
Conversation
Shows Z* > 1000 at 1M buckets under Dirichlet-multinomial scoring. Full normalization (lambda=1) yields 4.10 BPB — far worse than 1.13 neural baseline. 115+ experimental configurations across 8 experiment types.
Links to full paper and data at the public repo.
|
A few clarifications up front, since I expect the same questions to come up in review:
|
Summary
This is a non-record analytical submission (following the precedent of PR #363) providing the mathematical explanation for why hashed n-gram caches appear to dramatically improve BPB but actually produce invalid probability distributions.
We derive the partition function Z of the recursive Dirichlet-multinomial scoring function used by cache-augmented submissions (#777, #796, #900, #986, and others). The key result:
Z* = (n_c + V · L) / (n_c + L)
where V = 1024 (vocab size), L = N/B (load factor), and n_c is the true context count. At 1M buckets (L ≈ 59), Z* exceeds 1,000 — meaning the scoring function's outputs sum to ~1,000 instead of 1.
A partial normalization sweep (dividing by Z^λ for λ ∈ [0, 1]) directly measures the penalty: at λ = 1 (full normalization), 1M buckets scores 4.10 BPB — far above the 1.130 neural baseline. Even 25% normalization (λ ≈ 0.25) erases the apparent gain.
Key Findings
Monotonic bucket ranking explained: Smaller buckets → higher L → larger Z* → more inflation → lower (invalid) BPB. Confirmed across 36 configurations (4 bucket sizes × 9 concentration values) and 27 per-order concentration profiles.
Random collisions = real collisions: Rate-matched random collision partners replicate 99.4% of the observed BPB gap (gap = 0.009 BPB on 1.619 range). Across 8 independent remap seeds, std = 0.00004 BPB. The mechanism is count-inflation magnitude, not collision-partner structure.
Synthetic floor: Adding uniform fake counts to clean 64M-bucket tables achieves 0.020 BPB — beating all real cache configurations — despite zero linguistic content.
Normalization eliminates apparent gains: Under post-hoc normalization (λ = 1), all four tested bucket sizes produce BPB ≈ 4.1, all far above the 1.130 neural baseline. Step-wise normalization yields 4.15 BPB at 1M buckets. PR Review: Rerun of #972 with actual full-vocab normalization #978 (AnirudhRahul) independently confirmed: normalizing crashed performance from 0.39 to 1.51 BPB.
Partial normalization sweep: BPB increases linearly with λ (R² > 0.9999 across 7 measured points). Slope ≈ 3.94 BPB/unit λ at 1M. Validated at a second bucket size (64M, λ = 0.5).
Z diagnostic: The recurrence Z_n = (S_n + α · Z_{n-1}) / (c_n + α) provides a check. If Z ≫ 1, unnormalized scores are unreliable. Z* > 1 is unavoidable for any L > 0 and V > 1.
Diagnostic Code
The submission includes diagnostic tools as environment variables:
Context
The author contributed the Dirichlet smoothing framework to the competition (PR #796, March 26, 2026) and several cache submissions (#777, #796, #900), all of which were closed along with other cache-based entries in the late March ruling. This analysis grew out of investigating why our own submissions produced scores that were too good to be true. PR #978 (AnirudhRahul) independently identified the same normalization gap; PR #1114 (minh-stakc) credits PR #900 for "Dirichlet posterior mixing theory."
Experimental Data (115+ configurations)
Full paper (24 pages), all CSVs, and reproducible figure generation: https://github.com/Robby955/partition-function-inflation
Score
Not applicable — this is an analytical contribution. The neural-only baseline achieves 1.130 BPB. All cache configurations perform worse after normalization.