Record: Custom Casefold Tokenizer — 1.0668 BPB#1578
Record: Custom Casefold Tokenizer — 1.0668 BPB#1578mikeapedia wants to merge 1 commit intoopenai:mainfrom
Conversation
Casefold v2 vocabulary on PR openai#1529 parallel residuals architecture. Eliminates case-duplicate tokens (21.1% of SP8192 vocab), refills with BPB-optimized subwords for 10.38% better compression. Byte counting verified correct on 15.4M FineWeb docs (0 mismatches).
|
@msisovic I didn't try to get your custom kernel to work, but wanted to tag you to 1. say thanks for PR 1529, and 2. to say I'm out of compute credits so I'm not going after the record with custom kernels. It's all yours if you want to go for it! |
…0639 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0639 BPB / 3.0705 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I think there is one important rules/comparability question here that needs maintainer clarification. The PR says the full FineWeb train and val were retokenized after applying My concern is that this looks like changing the benchmark text itself, not just using a different tokenizer on the same validation set. The README describes the challenge as compression on the FineWeb validation set, tokenizer-agnostic in BPB, and I am not sure that lowercasing / case-folding the validation corpus is comparable to the standard submissions on the original text. Can maintainers clarify whether this kind of validation-text normalization is allowed? If yes, it would be helpful to state explicitly that submissions may transform the validation text itself (for example via case folding) as long as byte accounting is then done on the transformed text. If not, I think this PR would need to be treated differently from a tokenizer-only change. |
|
Disclosure: I have a dependent submission (PR #1585, systems optimization on top of this PR). The rules support the validity of the approach in PR #1578:
Would appreciate a maintainer ruling here. |
|
@dexhunter - Great question! I actually investigated this exact concern during development. The short answer is: every SentencePiece submission already transforms the validation text before byte counting, in exactly the same way. OpenAI's baseline train_gpt.py counts bytes via build_sentencepiece_luts() (line 180), which calls sp.id_to_piece(token_id) and measures len(piece.encode("utf-8")). These pieces come from the tokenizer's internal vocabulary, which reflects text after SentencePiece's nmt_nfkc normalization. SP's normalizer silently: Applies NFKC decomposition (½ → 1/2, ² → 2) — changes byte count My case folding adds .lower() after NFKC — one additional normalization step in the same pipeline. The byte counting mechanism is identical: LUT built from sp.id_to_piece(), same as every other submission. I verified this produces exact byte matches on all 15.4M FineWeb documents. The competition explicitly allows custom tokenizers (Issues #43, #897), and the rules state: "Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set." That said, I agree this is worth explicit maintainer clarification — the normalization question applies to ALL submissions, not just mine. Happy to discuss further. |
|
Echoing @dexhunter, I think this normalization is problematic. As mentioned, standard tokenization does have some normalization, but there are clearly degenerate normalizations that destroy information. As an extreme example, I could "normalize" the sequence with Similarly, but much less extreme, making everything lowercase makes the prediction task easier. As some napkin math to show this, take the baseline. From your readme, the baseline has tok/byte = 0.268 and ~40M validation tokens. At BPB = 1.0784, that's ~4.025 bits/token. Let's estimate 5% of tokens are at uppercase positions, so ~2M tokens. Now suppose casefolding doesn't change the probability of lowercase tokens (conservative), but doubles the probability of the correct token at each of the uppercase positions (halving uncertainty bc no longer have to guess between upper/lowercase, albeit an aggressive estimate). Since This is very similar to the reported improvement here and I think where the gain is coming from. So, yes, BPB is lossless-compression agnostic, but this PR makes the target distribution different / easier to model. |
|
Hey @samacqua - I agree the math is directionally right. Case folding does reduce prediction entropy, and the estimate (-0.013) is in the right ballpark, though as you noted the "doubling probability" assumption is aggressive. If we assume the model already predicts case correctly 80-90% of the time from context (sentence-initial caps, proper nouns, acronyms are all highly predictable) it would gives -0.004 to -0.010. But NFKC does the exact same thing. NFKC collapses 4,964 distinct Unicode codepoints into different representations — 662 many-to-one mappings where the model no longer has to distinguish between alternatives. For example, 20 different Unicode codepoints all normalize to D (fullwidth, mathematical bold/italic/script, double-struck, etc.). SP also collapses whitespace variants and converts newlines. Each of these reduces prediction entropy by eliminating distinctions the model would otherwise need to predict. The difference is degree, not kind. The "degenerate normalization" argument cuts both ways. Yes, lambda char: "a" is an extreme that destroys all information. But NFKC already sits on the same spectrum — it destroys the distinction between fi and fi, between ½ and 1⁄2, between ² and 2. The competition allows these normalizations because they collapse semantically equivalent representations. Case folding does the same: The and the carry the same semantic content; the case distinction is positional/conventional, not semantic. That said, I acknowledge this is a rules question, not a technical one. So I look forward to the maintainers weighing in. |
…ai#1586 per-layer GPTQ highest-EV - PR openai#758 n-gram effectively dead: MatoTeziTanka (Apr 12) flagged XOR hash includes target token, same illegality as openai#727/openai#741 - GDN-Hybrid BPB bug confirmed: PR openai#1576 space-token double-count inflates denominator ~14%; actual score ~1.16-1.18, not 1.01671 - PR openai#1586 (dexhunter, 1.07493): Per-Layer Adaptive GPTQ MLP=12σ/Attn=13σ + int7 Emb (saves 530KB) + MLR=0.026; -0.0127 nats vs SOTA; implement now - PR openai#1584: systems-only (fused Muon, batched EMA, loader prealloc) ~+20 steps - Casefold Tokenizer (openai#1578/openai#1585): legality debated; await organizer ruling - New paper: arXiv:2604.06169 In-Place TTT (Apr 7) NTP-aligned score-first TTT - Merged SOTA 1.0810 unchanged (4-day stable streak); target ≤1.0760; 17 days https://claude.ai/code/session_01BE8wc8zxvZAo52QBXSNiL8
|
@0hq, @valerio-oai, & @cocohearts - I know you are all incredibly busy but when you have time I would very much appreciate and welcome a review of the legality of casefolding. |
|
Hey @mikeapedia - I think there's a clear difference between NKFC normalization and case folding, although the line for valid normalization will always be somewhat arbitrary. Ultimately the point of this competition is to model language accurately, and capitalization, to me, seems a core part of accurately modeling the syntax and semantics of a language. First, I think it's an incorrect assertion that capitalization has nothing to do with semantic meaning. For example, any proper noun that no longer is capitalized loses semantic meaning. Second, it intuitively seems incorrect to me that accurately modeling language only has to do with semantic meaning, as being syntactically accurate is also part of accurately modeling a language. |
|
Hey @romeerp - I appreciate the thoughtful response, and I actually agree with your core argument that accurately modeling language involves both semantics and syntax, and capitalization carries syntactic (and sometimes semantic) signal. The proper noun example is a good one. But I think this argument proves too much. If the standard is that normalization shouldn't discard syntactically or semantically meaningful distinctions, then NFKC shouldn't be allowed either. NFKC collapses 4,964 distinct Unicode codepoints including distinctions that are both syntactically and semantically meaningful (e.g. ligatures, mathematical notation, superscripts and fractions, etc.). It also collapses consecutive whitespace and converts newlines to spaces which destroys paragraph structure. And I would argue that is more syntactically meaningful than case. I personally don't think NFKC should be banned. I think it's a reasonable normalization. But the line between "acceptable normalization" and "changing the benchmark" is, as you said, somewhat arbitrary. I believe case folding sits on the same spectrum as NFKC: both collapse distinct representations that carry some signal, both make the prediction task easier, and both are motivated by the idea that the collapsed representations are "close enough" to be treated as equivalent. If the competition had used NFC from the start, I would have argued that any lossy normalization shouldn't be allowed. But since this competition has always used lossy normalization, I believe that case folding is both in the spirit and the rules of the competition. Ultimately, it's up to the maintainers on where to draw the line, and I'll respect whatever they decide. But regardless of what they decide, I hope they explicitly document which normalizations are permitted, because right now the only normalization in the codebase (nmt_nfkc) was inherited from SentencePiece defaults, not from an explicit policy decision. |
|
I opened an issue where we can have the normalization discussion since I feel it is bigger than this single PR |
|
The default tokenizer already applies fairly destructive normalization. Applying further destructive normalization to the validation set means that what is being measured for this submission is not comparable to what is being measured for other submissions. |
Summary
verify_bytes.pyWhat Changed (Only the Tokenizer)
The only difference from PR #1529 is the tokenizer and data. Architecture, optimizer, hyperparameters, and training budget are identical.
NFKC(text).lower(). Both train and val use the same normalized representation.Results
On Case Normalization
Case folding reduces the entropy of the input — the model no longer predicts capitalization. We believe this is a legal normalization for the same reason NFKC normalization (applied by all SentencePiece submissions) is legal: it maps semantically equivalent representations to a canonical form without changing the meaning of the text. BPB correctly measures how well the model predicts the normalized text it actually sees. We welcome judges' feedback on this point.
Byte Accounting — Verified Correct
Custom tokenizers have historically caused byte counting bugs (PRs #1143, #755). Our byte counting is verified correct: LUT byte count exactly matches ground-truth bytes on every document in the full 15.4M-document FineWeb corpus (0 mismatches). Judges can spot-check with the bundled 200-doc sample (~30s, no GPU):
Credits
Test plan
python verify_bytes.py --docs verify_docs.jsonlpasses