Non-record: Scylla Tokenizer Byte Accounting Audit — Sub-1.0 Was a Measurement Error#1271
Open
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Scylla Tokenizer Byte Accounting Audit — Sub-1.0 Was a Measurement Error#1271andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
Conversation
…asurement error PR openai#1184's 0.9485 BPB becomes 1.1289 with corrected byte accounting. 93% of the gap is byte denominator inflation, not model quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
An audit of PR #1184's Scylla tokenizer byte accounting. I ran their exact
code with corrected
candidate.meta.npzand proper val data. The result:1.1289 BPB, not 0.9485.
The sub-1.0 claim was a measurement error.
The bug
PR #1184's
candidate.meta.npzhas 27 byte-fallback tokens (IDs 75-101)with
base_bytes=3instead of 1. These tokens represent single raw bytesbut are counted as 3 bytes each. This overcounts the byte denominator in
the BPP formula, making the score look ~4% better than it actually is.
This was originally flagged by @dexhunter on PR #1143 (the earlier Scylla
submission, which was closed for exactly this reason). PR #1184 reuses the
same buggy
candidate.meta.npz.My test
I ran PR #1184's exact, unmodified
train_gpt.pywith every env varmatching their README:
The only change: a corrected
candidate.meta.npzthat fixes the 27byte-fallback tokens from
base_bytes=3tobase_bytes=1. Everything elseis identical — same architecture, same optimizer, same GPTQ, same data.
I also retokenized the val shard directly from
docs_selected.jsonl(noSP1024 roundtrip) using the official split: shuffle with seed 1337, last
50K docs = val. This produced 62.6M val tokens, close to their 62.4M.
Results
The NLL is nearly identical (1.928 vs 1.916 — my model is actually slightly
better). The entire 0.18 BPP gap comes from different byte accounting.
Decomposing the gap
I decomposed the BPB formula
(NLL / ln2) × (tokens / bytes)to isolatewhat's driving the difference:
93% of the gap is byte accounting, not model quality. The Scylla
tokenizer doesn't make the model predict better, the buggy meta just
makes the denominator bigger, which makes BPP smaller.
What this means
With corrected accounting, the Scylla stack lands at ~1.13 BPB. This is
essentially the same as the SP1024 stack at ~1.11-1.12. The tokenizer
itself provides no meaningful advantage.
I'd like to flag PR #1184 for review.
Corrected files included
correct_meta.npz— fixes only the 27 byte-fallback tokens, leaveseverything else unchanged (has_leading_space=0, is_boundary=0)
retokenize_proper.py— retokenizes from rawdocs_selected.jsonlwith proper train/val split (shuffle seed 1337, last 50K = val)
Reproducing
Request for review
@0hq @valerio-oai PR #1184 should be re-evaluated with corrected byte
accounting before being merged.