Skip to content

Record: Custom Casefold Tokenizer — 1.0668 BPB#1578

Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:casefold-v2-tokenizer
Open

Record: Custom Casefold Tokenizer — 1.0668 BPB#1578
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:casefold-v2-tokenizer

Conversation

@mikeapedia
Copy link
Copy Markdown

Summary

  • val_bpb: 1.06681663 (3-seed mean, std 0.00128512) | 8xH100 SXM, 600s | Legal TTT
  • Combines PR Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529 parallel residuals architecture with a custom casefold v2 vocabulary
  • Eliminates case-duplicate tokens (21.1% of SP8192 vocab), refills 374 freed slots with BPB-optimized subwords
  • 10.38% better compression, resulting in -0.0116 BPB improvement vs no-CUTLASS baseline (apples-to-apples)
  • Byte counting verified correct on 15.4M FineWeb docs (0 mismatches) — see verify_bytes.py

What Changed (Only the Tokenizer)

The only difference from PR #1529 is the tokenizer and data. Architecture, optimizer, hyperparameters, and training budget are identical.

  1. Casefold v2 vocabulary: Starting from SP8192 (PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334), retrained SentencePiece BPE on NFKC + lowercased text. 374 freed case-duplicate slots refilled with BPB-optimized subwords (numerical tokens, contractions, bare punctuation).
  2. Retokenized dataset: Full FineWeb 10B retokenized with NFKC(text).lower(). Both train and val use the same normalized representation.
  3. No code changes: Architecture, Muon optimizer, EMA, GPTQ quantization, and TTT evaluation are all identical to PR Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529.

Results

Seed Legal TTT BPB Artifact
1337 1.06507576 15,993,577
2024 1.06813909 15,992,914
42 1.06723505 15,996,484
Mean 1.06681663 15,994,325

On Case Normalization

Case folding reduces the entropy of the input — the model no longer predicts capitalization. We believe this is a legal normalization for the same reason NFKC normalization (applied by all SentencePiece submissions) is legal: it maps semantically equivalent representations to a canonical form without changing the meaning of the text. BPB correctly measures how well the model predicts the normalized text it actually sees. We welcome judges' feedback on this point.

Byte Accounting — Verified Correct

Custom tokenizers have historically caused byte counting bugs (PRs #1143, #755). Our byte counting is verified correct: LUT byte count exactly matches ground-truth bytes on every document in the full 15.4M-document FineWeb corpus (0 mismatches). Judges can spot-check with the bundled 200-doc sample (~30s, no GPU):

pip install sentencepiece
python verify_bytes.py --docs verify_docs.jsonl

Credits

Test plan

  • 3-seed training on 8xH100 SXM (seeds 1337, 2024, 42)
  • Byte verification on full 15.4M-doc FineWeb corpus (0 mismatches)
  • Bundled 200-doc spot-check sample for judges
  • Judges verify python verify_bytes.py --docs verify_docs.jsonl passes
  • Judges confirm case normalization is legal under competition rules

Casefold v2 vocabulary on PR openai#1529 parallel residuals architecture.
Eliminates case-duplicate tokens (21.1% of SP8192 vocab), refills with
BPB-optimized subwords for 10.38% better compression. Byte counting
verified correct on 15.4M FineWeb docs (0 mismatches).
@mikeapedia
Copy link
Copy Markdown
Author

@msisovic I didn't try to get your custom kernel to work, but wanted to tag you to 1. say thanks for PR 1529, and 2. to say I'm out of compute credits so I'm not going after the record with custom kernels. It's all yours if you want to go for it!

codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request Apr 13, 2026
…0639

Systems-level optimizations (fused Muon, EMA foreach, loader prealloc)
on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals.
Identical ML; faster step time yields extra training steps. 3-seed mean:
1.0639 BPB / 3.0705 nats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dexhunter
Copy link
Copy Markdown
Contributor

I think there is one important rules/comparability question here that needs maintainer clarification.

The PR says the full FineWeb train and val were retokenized after applying unicodedata.normalize("NFKC", text).lower(), and the README / verify_bytes.py then validate bytes against the post-normalization text rather than the original validation text.

My concern is that this looks like changing the benchmark text itself, not just using a different tokenizer on the same validation set. The README describes the challenge as compression on the FineWeb validation set, tokenizer-agnostic in BPB, and I am not sure that lowercasing / case-folding the validation corpus is comparable to the standard submissions on the original text.

Can maintainers clarify whether this kind of validation-text normalization is allowed? If yes, it would be helpful to state explicitly that submissions may transform the validation text itself (for example via case folding) as long as byte accounting is then done on the transformed text. If not, I think this PR would need to be treated differently from a tokenizer-only change.

@codemath3000
Copy link
Copy Markdown

codemath3000 commented Apr 13, 2026

Disclosure: I have a dependent submission (PR #1585, systems optimization on top of this PR).

The rules support the validity of the approach in PR #1578:

  1. The challenge is explicitly "tokenizer-agnostic, bits per byte." Rule 2 specifically addresses tokenizer modifications: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated." This sets a verification bar for custom tokenizers — not a prohibition. PR Record: Custom Casefold Tokenizer — 1.0668 BPB #1578 does exactly that, with 15.3M-document exact byte verification.

  2. Issues [Question] Tokenizer is not counted in submission size #43 and BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897 confirm custom tokenizers are allowed: "Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set."

  3. Every existing submission already uses lossy text preprocessing. SentencePiece's built-in nmt_nfkc normalization collapses whitespace, decomposes Unicode characters, and discards formatting information. If lossy preprocessing were grounds for disqualification, virtually every submission on the leaderboard would be invalid. Case folding is a normalization step in the same category — part of the tokenizer pipeline, applied identically to train and val.

  4. BPB normalizes for tokenizer compression by construction — fewer tokens means higher per-token loss, and these effects cancel in the (tokens/bytes) scaling. The BPB improvement here comes from genuine learning advantages: 10% more text coverage per training step and elimination of wasted vocabulary slots on case duplicates.

Would appreciate a maintainer ruling here.

@mikeapedia
Copy link
Copy Markdown
Author

@dexhunter - Great question! I actually investigated this exact concern during development. The short answer is: every SentencePiece submission already transforms the validation text before byte counting, in exactly the same way.

OpenAI's baseline train_gpt.py counts bytes via build_sentencepiece_luts() (line 180), which calls sp.id_to_piece(token_id) and measures len(piece.encode("utf-8")). These pieces come from the tokenizer's internal vocabulary, which reflects text after SentencePiece's nmt_nfkc normalization. SP's normalizer silently:

Applies NFKC decomposition (½ → 1/2, ² → 2) — changes byte count
Collapses consecutive whitespace — changes byte count
Converts newlines to spaces — changes character identity
No submission counts bytes from the original raw validation text. They all count bytes from the tokenizer's post-normalization pieces via the LUT. This is by design — BPB measures how well the model predicts the text it actually sees, which is the normalized form.

My case folding adds .lower() after NFKC — one additional normalization step in the same pipeline. The byte counting mechanism is identical: LUT built from sp.id_to_piece(), same as every other submission. I verified this produces exact byte matches on all 15.4M FineWeb documents.

The competition explicitly allows custom tokenizers (Issues #43, #897), and the rules state: "Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set."

That said, I agree this is worth explicit maintainer clarification — the normalization question applies to ALL submissions, not just mine. Happy to discuss further.

@samacqua
Copy link
Copy Markdown
Contributor

Echoing @dexhunter, I think this normalization is problematic. As mentioned, standard tokenization does have some normalization, but there are clearly degenerate normalizations that destroy information. As an extreme example, I could "normalize" the sequence with lambda char: "a", which would be trivial to predict.

Similarly, but much less extreme, making everything lowercase makes the prediction task easier.

As some napkin math to show this, take the baseline. From your readme, the baseline has tok/byte = 0.268 and ~40M validation tokens. At BPB = 1.0784, that's ~4.025 bits/token. Let's estimate 5% of tokens are at uppercase positions, so ~2M tokens.

Now suppose casefolding doesn't change the probability of lowercase tokens (conservative), but doubles the probability of the correct token at each of the uppercase positions (halving uncertainty bc no longer have to guess between upper/lowercase, albeit an aggressive estimate). Since −log_2(2p) = −log_2(p) − 1, each uppercase position saves exactly 1 bit. That's 2M bits saved over 40M / 0.268 ~= 150M bytes, giving delta BPB ~= −0.013.

This is very similar to the reported improvement here and I think where the gain is coming from. So, yes, BPB is lossless-compression agnostic, but this PR makes the target distribution different / easier to model.

@mikeapedia
Copy link
Copy Markdown
Author

Hey @samacqua - I agree the math is directionally right. Case folding does reduce prediction entropy, and the estimate (-0.013) is in the right ballpark, though as you noted the "doubling probability" assumption is aggressive. If we assume the model already predicts case correctly 80-90% of the time from context (sentence-initial caps, proper nouns, acronyms are all highly predictable) it would gives -0.004 to -0.010.

But NFKC does the exact same thing. NFKC collapses 4,964 distinct Unicode codepoints into different representations — 662 many-to-one mappings where the model no longer has to distinguish between alternatives. For example, 20 different Unicode codepoints all normalize to D (fullwidth, mathematical bold/italic/script, double-struck, etc.). SP also collapses whitespace variants and converts newlines. Each of these reduces prediction entropy by eliminating distinctions the model would otherwise need to predict. The difference is degree, not kind.

The "degenerate normalization" argument cuts both ways. Yes, lambda char: "a" is an extreme that destroys all information. But NFKC already sits on the same spectrum — it destroys the distinction between fi and fi, between ½ and 1⁄2, between ² and 2. The competition allows these normalizations because they collapse semantically equivalent representations. Case folding does the same: The and the carry the same semantic content; the case distinction is positional/conventional, not semantic.

That said, I acknowledge this is a rules question, not a technical one. So I look forward to the maintainers weighing in.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 13, 2026
…ai#1586 per-layer GPTQ highest-EV

- PR openai#758 n-gram effectively dead: MatoTeziTanka (Apr 12) flagged XOR hash
  includes target token, same illegality as openai#727/openai#741
- GDN-Hybrid BPB bug confirmed: PR openai#1576 space-token double-count inflates
  denominator ~14%; actual score ~1.16-1.18, not 1.01671
- PR openai#1586 (dexhunter, 1.07493): Per-Layer Adaptive GPTQ MLP=12σ/Attn=13σ +
  int7 Emb (saves 530KB) + MLR=0.026; -0.0127 nats vs SOTA; implement now
- PR openai#1584: systems-only (fused Muon, batched EMA, loader prealloc) ~+20 steps
- Casefold Tokenizer (openai#1578/openai#1585): legality debated; await organizer ruling
- New paper: arXiv:2604.06169 In-Place TTT (Apr 7) NTP-aligned score-first TTT
- Merged SOTA 1.0810 unchanged (4-day stable streak); target ≤1.0760; 17 days

https://claude.ai/code/session_01BE8wc8zxvZAo52QBXSNiL8
@mikeapedia
Copy link
Copy Markdown
Author

@0hq, @valerio-oai, & @cocohearts - I know you are all incredibly busy but when you have time I would very much appreciate and welcome a review of the legality of casefolding.

@romeerp
Copy link
Copy Markdown

romeerp commented Apr 13, 2026

Hey @mikeapedia - I think there's a clear difference between NKFC normalization and case folding, although the line for valid normalization will always be somewhat arbitrary. Ultimately the point of this competition is to model language accurately, and capitalization, to me, seems a core part of accurately modeling the syntax and semantics of a language. First, I think it's an incorrect assertion that capitalization has nothing to do with semantic meaning. For example, any proper noun that no longer is capitalized loses semantic meaning. Second, it intuitively seems incorrect to me that accurately modeling language only has to do with semantic meaning, as being syntactically accurate is also part of accurately modeling a language.

@mikeapedia
Copy link
Copy Markdown
Author

Hey @romeerp - I appreciate the thoughtful response, and I actually agree with your core argument that accurately modeling language involves both semantics and syntax, and capitalization carries syntactic (and sometimes semantic) signal. The proper noun example is a good one.

But I think this argument proves too much. If the standard is that normalization shouldn't discard syntactically or semantically meaningful distinctions, then NFKC shouldn't be allowed either. NFKC collapses 4,964 distinct Unicode codepoints including distinctions that are both syntactically and semantically meaningful (e.g. ligatures, mathematical notation, superscripts and fractions, etc.). It also collapses consecutive whitespace and converts newlines to spaces which destroys paragraph structure. And I would argue that is more syntactically meaningful than case.

I personally don't think NFKC should be banned. I think it's a reasonable normalization. But the line between "acceptable normalization" and "changing the benchmark" is, as you said, somewhat arbitrary. I believe case folding sits on the same spectrum as NFKC: both collapse distinct representations that carry some signal, both make the prediction task easier, and both are motivated by the idea that the collapsed representations are "close enough" to be treated as equivalent.

If the competition had used NFC from the start, I would have argued that any lossy normalization shouldn't be allowed. But since this competition has always used lossy normalization, I believe that case folding is both in the spirit and the rules of the competition.

Ultimately, it's up to the maintainers on where to draw the line, and I'll respect whatever they decide. But regardless of what they decide, I hope they explicitly document which normalizations are permitted, because right now the only normalization in the codebase (nmt_nfkc) was inherited from SentencePiece defaults, not from an explicit policy decision.

@mikeapedia
Copy link
Copy Markdown
Author

I opened an issue where we can have the normalization discussion since I feel it is bigger than this single PR
#1604

@sharpobject
Copy link
Copy Markdown

The default tokenizer already applies fairly destructive normalization. Applying further destructive normalization to the validation set means that what is being measured for this submission is not comparable to what is being measured for other submissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants