Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Using Scylla
Scylla is a byte-exact TokenMonster-derived tokenizer path for Parameter Golf.

The packaged tokenizer artifact in this folder is `scylla.yaml`, with companion metadata `scylla.meta.npz`.

## Bundle And Runtime Requirements

Scylla depends on two pipeline requirements beyond the tokenizer artifact itself:

1. `charset:none` decoded strings must be interpreted as raw bytes via `latin-1`, not `utf-8`
2. flat binary shards need an explicit synthetic zero-byte BOS token so document boundaries survive export and exactness auditing

Any future Scylla-based dataset or eval path should preserve those requirements.

## Exactness Audit

The strict full-validation audit result is recorded in `FULL_VAL_AUDIT.json`.

Audit command used in the main repo workspace:

```bash
.venv/bin/python3 data/audit_tokenmonster_bundle.py \
--source-root data \
--bundle-root /Users/simon/Code/parameter-golf-local/scylla_v2_cap0_competition_export \
--bundle-dataset fineweb10B_scylla_v2_cap0_fullbyte \
--bundle-tokenizer tokenizers/scylla_v2_cap0_fullbyte.yaml \
--bundle-meta tokenizers/scylla_v2_cap0_fullbyte.meta.npz \
--strict
```

How to read those arguments:

- `--source-root`
Root of the canonical SP1024 challenge dataset and tokenizer. In a standard repo checkout, first run:

```bash
python3 data/cached_challenge_fineweb.py --variant sp1024
```

This populates:

- `data/datasets/fineweb10B_sp1024/`
- `data/tokenizers/fineweb_1024_bpe.model`

In that standard layout, `--source-root` is simply `data`.
- `--bundle-root`
Root of the Scylla bundle export.
- `--bundle-dataset`
Dataset name inside the bundle manifest. You can read this from `manifest.json` under `datasets[0].name`.
- `--bundle-tokenizer`
Relative tokenizer artifact path inside the bundle. You can read this from `manifest.json` under `tokenizers[0].path`.
- `--bundle-meta`
Relative metadata path inside the bundle. You can read this from `manifest.json` under `tokenizers[0].meta_path`.

If you repack or relocate Scylla, `manifest.json` is the source of truth for the last three values.

Example full-validation result:

- `source_val_docs = 50000`
- `bundle_val_docs = 50000`
- `source_bytes = 151080891`
- `meta_bytes = 151080891`
- `decoded_bytes = 151080891`
- `bad_docs = 0`
- `meta_overcount_frac = 0.0`
- `decoded_drift_frac = 0.0`

So Scylla is byte-exact on the fixed FineWeb validation text.

## Invariants For Future Scylla Work

Any future Scylla-based submission should be treated as invalid unless it preserves all of the following:

- exact validation bytes
- exact metadata denominator
- explicit document-boundary handling
- full-val equality:
- `source_bytes == meta_bytes == decoded_bytes`

## Artifact Checksums

- `scylla.yaml`
- `sha256 = a0177241aca1871f861fec49b7f1ee737d029e8e09e320b0efd5d5ea7bee5517`
- `scylla.meta.npz`
- `sha256 = 849652277e70b378468194b9b6d40ddc574a980522443421e1dce1016721ed72`
- `manifest.json`
- `sha256 = 418170f7c5ccab7dcfe51e59b185f4fd6fc64c285239e635298347cd6eaff63f`
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"source_val_tokens": 62021846,
"bundle_val_tokens": 64893341,
"source_val_docs": 50000,
"bundle_val_docs": 50000,
"bos_id": 1253,
"source_bytes": 151080891,
"meta_bytes": 151080891,
"decoded_bytes": 151080891,
"bad_docs": 0,
"meta_overcount_frac": 0.0,
"decoded_drift_frac": 0.0,
"normalization": "None",
"charset_encoding": "latin-1"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Scylla: Corrected Byte-Exact Tokenizer Path

This PR packages the corrected, official revision of **Scylla**, our TokenMonster-derived tokenizer line for Parameter Golf.

We were pleased to see [Scylla](https://github.com/openai/parameter-golf/pull/1143) open what appears to be the competition's first substantial custom-tokenizer line. We were even more pleased, in the end, that people read it closely enough to break it. The critique from @NoesisGenesis, @dexhunter, and later @andrewbaggio1 on byte accounting and exactness was correct and genuinely helpful. It forced a deeper audit than we had originally performed, and the result is better for it.

We were also delighted to see other "golfers" swiftly start building with Scylla in PRs like #1184, #1242, #1274, and #1289. But once the byte-accounting issue had been correctly surfaced, it was clear that the responsible thing to do was not to defend the old path harder, but to rebuild it properly.

What we present here is **Scylla, revised**: a robust, byte-exact tokenizer path for the fixed FineWeb validation text, together with the metadata and audit artifacts needed to review it.

> This is **not** a leaderboard claim. It is a tokenizer contribution and a corrected reference path for future Scylla-based work. For clarity: in this folder, **Scylla** means the corrected official revision. The original `998`-token path from PR `#1143` is superseded by the artifact set here.

## What Was Wrong Before

The original `998`-token Scylla path from PR #1143 had two separate correctness problems:

1. Its byte-accounting metadata treated TokenMonster tokens as if their decoded byte lengths were context-free.
2. Its retokenized validation stream was not byte-identical to the fixed FineWeb validation text.

Those are distinct failures, and both matter for a tokenizer-agnostic `val_bpb` benchmark.

The repair path was not obvious at first. In the first byte-native audit lane, a converted Scylla-family vocabulary round-tripped `187/200` sampled validation documents exactly, while `13` remained stubbornly wrong. Those failures clustered almost entirely in non-ASCII / UTF-8 cases. The first clue was incomplete high-byte fallback coverage; fixing that collapsed the failure surface dramatically. The remaining holdouts included Turkish dotted `İ`, which exposed a deeper capcode interaction. That was the moment the shape of the real fix became clear: not another local patch, but a genuinely byte-native tokenizer regime.

## What Changed

The Corrected Scylla presented here uses a byte-native TokenMonster regime:

- `capcode = 0`
- `charset = none`
- `normalization = none`
- explicit `0x00..0xFF` byte fallback coverage

The bundle/export path also needed two additional corrections:

- `charset:none` TokenMonster decoded strings must be interpreted as raw bytes via `latin-1`, not `utf-8`
- a synthetic zero-byte BOS token must be inserted at dataset/export time so the flat shard format preserves document boundaries exactly

The resulting tokenizer metadata and dataset bundle now admit exact, reviewable byte accounting.

## Full-Validation Exactness

We ran a strict full-validation audit against the fixed SP1024 FineWeb validation source. The corrected Scylla bundle yields:

- `source_val_docs = 50000`
- `bundle_val_docs = 50000`
- `source_bytes = 151080891`
- `meta_bytes = 151080891`
- `decoded_bytes = 151080891`
- `bad_docs = 0`
- `meta_overcount_frac = 0.0`
- `decoded_drift_frac = 0.0`

That is the whole point of this revision. The source text, the decoded tokenizer stream, and the metadata-derived denominator now agree exactly on the full validation shard.

## Included Artifacts

- `scylla.yaml`
The corrected Scylla tokenizer artifact.
- `scylla.meta.npz`
The corrected byte-accounting metadata.
- `manifest.json`
Bundle manifest for the corrected full-data export.
- `BUILD_NOTES.md`
Construction notes, invariants, and the exact audit path for future Scylla-based work.
- `FULL_VAL_AUDIT.json`
Full-validation exactness audit results.

## Why We Are Publishing This

We think novel tokenizer work belongs in this competition. It changes the shape of the problem in an interesting way, and it deserves to be explored in public rather than in a private thicket of half-verified local hacks.

So this PR is meant as a community contribution:

- a corrected Scylla reference path
- an explicit accounting story
- a cleaner base for future tokenizer experimentation

We hope others extend it, stress it, improve it, and, ideally, beat it.

## Thanks

We are indebted to @NoesisGenesis, @dexhunter, and @andrewbaggio1 for pressing on the exactness and byte-accounting questions. Their scrutiny materially improved this work.
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"version": "10B",
"num_docs": 835771,
"num_val_docs": 50000,
"shuffle_seed": 1337,
"dataset_revision": "9bb295ddab0e05d785b879661af7260fed5140fc",
"shard_size": 100000000,
"append_eos": false,
"docs_jsonl": "docs_selected.jsonl",
"docs_meta": {
"remote_name": "external_cache",
"num_docs": 15368808,
"docs_sha256": null,
"dataset_fingerprint": null
},
"tokenizer_specs": [],
"tokenizers": [
{
"name": "scylla",
"kind": "tokenmonster",
"vocab_size": 1254,
"logical_vocab_size": 1178,
"max_token_id": 1252,
"bos_id": 1253,
"eos_id": -1,
"recommended_bigram_vocab_size": 6400,
"path": "scylla.yaml",
"meta_path": "scylla.meta.npz",
"source_spec": {
"kind": "tokenmonster",
"source_model": "scylla.yaml",
"charset": "None",
"capcode": 0,
"normalization": "None",
"logical_vocab_size": 1178,
"max_token_id": 1252
}
}
],
"datasets": [
{
"name": "fineweb10B_scylla",
"tokenizer_name": "scylla",
"tokenizer_kind": "tokenmonster",
"path": "datasets/fineweb10B_scylla",
"train_glob": "datasets/fineweb10B_scylla/fineweb_train_*.bin",
"val_glob": "datasets/fineweb10B_scylla/fineweb_val_*.bin",
"vocab_size": 1254,
"logical_vocab_size": 1178,
"max_token_id": 1252,
"bos_id": 1253,
"eos_id": -1,
"recommended_bigram_vocab_size": 6400,
"stats": {
"docs_total": 835771,
"docs_val": 50000,
"docs_train": 785771,
"files_total": 12,
"files_val": 1,
"files_train": 11,
"tokens_total": 1110765476,
"tokens_val": 64893341,
"tokens_train": 1045872135
}
}
]
}
Binary file not shown.
Loading