Scylla: Corrected Byte-Exact Tokenizer Path#1314
Open
simon-marcus wants to merge 2 commits intoopenai:mainfrom
Open
Scylla: Corrected Byte-Exact Tokenizer Path#1314simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus wants to merge 2 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scylla: Corrected Byte-Exact Tokenizer Path
This PR packages the corrected, official revision of Scylla, our TokenMonster-derived tokenizer line for Parameter Golf.
We were pleased to see Scylla open what appears to be the competition's first substantial custom-tokenizer line. We were even more pleased, in the end, that people read it closely enough to break it. The critique from @NoesisGenesis, @dexhunter, and later @andrewbaggio1 on byte accounting and exactness was correct and genuinely helpful. It forced a deeper audit than we had originally performed, and the result is better for it.
We were also delighted to see other "golfers" swiftly start building with Scylla in PRs like #1184, #1242, #1274, and #1289. But once the byte-accounting issue had been correctly surfaced, it was clear that the responsible thing to do was not to defend the old path harder, but to rebuild it properly.
What we present here is Scylla, revised: a robust, byte-exact tokenizer path for the fixed FineWeb validation text, together with the metadata and audit artifacts needed to review it.
What Was Wrong Before
The original
998-token Scylla path from PR #1143 had two separate correctness problems:Those are distinct failures, and both matter for a tokenizer-agnostic
val_bpbbenchmark.The repair path was not obvious at first. In the first byte-native audit lane, a converted Scylla-family vocabulary round-tripped
187/200sampled validation documents exactly, while13remained stubbornly wrong. Those failures clustered almost entirely in non-ASCII / UTF-8 cases. The first clue was incomplete high-byte fallback coverage; fixing that collapsed the failure surface dramatically. The remaining holdouts included Turkish dottedİ, which exposed a deeper capcode interaction. That was the moment the shape of the real fix became clear: not another local patch, but a genuinely byte-native tokenizer regime.What Changed
The corrected Scylla presented here uses a byte-native TokenMonster regime:
capcode = 0charset = nonenormalization = none0x00..0xFFbyte fallback coverageThe bundle/export path also needed two additional corrections:
charset:noneTokenMonster decoded strings must be interpreted as raw bytes vialatin-1, notutf-8The resulting tokenizer metadata and dataset bundle now admit exact, reviewable byte accounting.
Full-Validation Exactness
We ran a strict full-validation audit against the fixed SP1024 FineWeb validation source. The corrected Scylla bundle yields:
source_val_docs = 50000bundle_val_docs = 50000source_bytes = 151080891meta_bytes = 151080891decoded_bytes = 151080891bad_docs = 0meta_overcount_frac = 0.0decoded_drift_frac = 0.0That is the whole point of this revision. The source text, the decoded tokenizer stream, and the metadata-derived denominator now agree exactly on the full validation shard.
Included Artifacts
scylla.yamlThe corrected Scylla tokenizer artifact.
scylla.meta.npzThe corrected byte-accounting metadata.
manifest.jsonBundle manifest for the corrected full-data export.
BUILD_NOTES.mdConstruction notes, invariants, and the exact audit path for future Scylla-based work.
FULL_VAL_AUDIT.jsonFull-validation exactness audit results.
Why We Are Publishing This
We think novel tokenizer work belongs in this competition. It changes the shape of the problem in an interesting way, and it deserves to be explored in public rather than in a private thicket of half-verified local hacks.
So this PR is meant as a community contribution:
We hope others extend it, stress it, improve it, and, ideally, beat it.
Thanks
We are indebted to @NoesisGenesis, @dexhunter, and @andrewbaggio1 for pressing on the exactness and byte-accounting questions. Their scrutiny materially improved this work.