-
Notifications
You must be signed in to change notification settings - Fork 27
Add vocabulary extraction option to tokenization tools #681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add vocabulary extraction option to tokenization tools #681
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds vocabulary extraction functionality to the tokenization tools, allowing users to export tokenizer vocabularies to JSON files for inspection and debugging. The implementation adds a new --extract-vocab CLI flag and implements vocabulary retrieval methods across all tokenizer classes.
Key changes:
- Added abstract
get_vocabulary()method to the baseTokenizerclass with implementations for all tokenizer subclasses (SentencePiece, Tiktoken, Custom, Byte, Char, and byte fallback variants) - Introduced
--extract-vocabcommand-line argument to enable vocabulary export functionality - Implemented vocabulary extraction logic in
prepare.pythat serializes vocabularies to JSON files with custom sorting
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| data/template/tokenizers.py | Added abstract get_vocabulary() method to base Tokenizer class and implemented it for all tokenizer subclasses (SentencePieceTokenizer, TiktokenTokenizer, CustomTokenizer, ByteTokenizer, CharTokenizer, CustomCharTokenizerWithByteFallback, JsonByteTokenizerWithByteFallback) |
| data/template/prepare.py | Added --extract-vocab CLI argument and vocabulary extraction logic that calls get_vocabulary(), converts tokens to strings, sorts by length/alphabetically, and exports to JSON |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def get_vocabulary(self): | ||
| vocab = [] | ||
| for token in self.itos.values(): | ||
| if isinstance(token, bytes): | ||
| vocab.append(token.decode('utf-8', errors='replace')) | ||
| else: | ||
| vocab.append(token) | ||
| return vocab |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:
def get_vocabulary(self):
vocab = []
for token_id in sorted(self.itos.keys()):
token = self.itos[token_id]
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocabThis ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.
| def get_vocabulary(self): | ||
| vocab = [] | ||
| for token in self.itos.values(): | ||
| if isinstance(token, bytes): | ||
| vocab.append(token.decode('utf-8', errors='replace')) | ||
| else: | ||
| vocab.append(token) | ||
| return vocab |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:
def get_vocabulary(self):
vocab = []
for token_id in sorted(self.itos.keys()):
token = self.itos[token_id]
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocabThis ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.
| def get_vocabulary(self): | ||
| """Return the list of string representations that make up the tokenizer's vocabulary.""" | ||
| raise NotImplementedError("Vocabulary extraction is not implemented for this tokenizer.") | ||
|
|
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new get_vocabulary() method lacks test coverage. Since the repository has comprehensive tests for other tokenizer methods, consider adding tests to verify that:
get_vocabulary()returns the expected format for each tokenizer type- The vocabulary can be successfully serialized to JSON
- Special cases (empty vocabularies, special tokens, byte representations) are handled correctly
- The vocabulary extraction flag in
prepare.pyworks end-to-end
|
|
||
| # Additional options | ||
| parser.add_argument("-T", "--track_token_counts", action="store_true", help="Track how often each token appears and store in meta.pkl") | ||
| parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file") |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The command-line argument uses a hyphen (--extract-vocab) which is inconsistent with Python conventions. The argument will be accessible as args.extract_vocab (with underscore) due to argparse's automatic conversion, but for consistency with other arguments in the file (e.g., --track_token_counts), consider using underscores in the flag name: --extract_vocab.
| parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file") | |
| parser.add_argument("-E", "--extract_vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file") |
| if not hasattr(tokenizer, "get_vocabulary"): | ||
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | ||
|
|
||
| vocab = tokenizer.get_vocabulary() | ||
| # Ensure string representations for all tokens | ||
| vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab] | ||
| vocab_strings.sort(key=lambda token: (-len(token), token)) | ||
|
|
||
| vocab_filename = f"{args.method}_vocab.json" | ||
| with open(vocab_filename, 'w', encoding='utf-8') as f: | ||
| json.dump(vocab_strings, f, ensure_ascii=False, indent=2) |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SineWaveTokenizer class doesn't inherit from the Tokenizer base class and therefore doesn't have a get_vocabulary method. When --extract-vocab is used with --method sinewave, the code will fail at line 160 when calling tokenizer.get_vocabulary() with an AttributeError.
Either SineWaveTokenizer should inherit from Tokenizer and implement get_vocabulary, or the vocabulary extraction logic in prepare.py should explicitly handle the sinewave case.
| if not hasattr(tokenizer, "get_vocabulary"): | |
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | |
| vocab = tokenizer.get_vocabulary() | |
| # Ensure string representations for all tokens | |
| vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab] | |
| vocab_strings.sort(key=lambda token: (-len(token), token)) | |
| vocab_filename = f"{args.method}_vocab.json" | |
| with open(vocab_filename, 'w', encoding='utf-8') as f: | |
| json.dump(vocab_strings, f, ensure_ascii=False, indent=2) | |
| if args.method == "sinewave": | |
| # SineWaveTokenizer does not have get_vocabulary; generate vocabulary as 0-255 | |
| vocab = [str(i) for i in range(256)] | |
| else: | |
| if not hasattr(tokenizer, "get_vocabulary"): | |
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | |
| vocab = tokenizer.get_vocabulary() | |
| # Ensure string representations for all tokens | |
| vocab = [token if isinstance(token, str) else str(token) for token in vocab] | |
| vocab.sort(key=lambda token: (-len(token), token)) | |
| vocab_filename = f"{args.method}_vocab.json" | |
| with open(vocab_filename, 'w', encoding='utf-8') as f: | |
| json.dump(vocab, f, ensure_ascii=False, indent=2) |
| return bytes(ids).decode('utf-8', errors='replace') | ||
|
|
||
| def get_vocabulary(self): | ||
| return [chr(i) for i in range(256)] |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The ByteTokenizer.get_vocabulary() method returns characters using chr(i) for byte values 0-255. However, many byte values in this range (0-31, 127-159) are control characters that don't have printable representations. Some values like 0-31 and 127 are non-printable ASCII control characters, and attempting to use chr() on values 128-255 may produce unexpected Unicode characters.
This could lead to issues when the vocabulary is serialized to JSON. Consider using a more appropriate representation for non-printable bytes, such as their hexadecimal or escaped form:
def get_vocabulary(self):
vocab = []
for i in range(256):
if 32 <= i < 127: # Printable ASCII range
vocab.append(chr(i))
else:
vocab.append(f"\\x{i:02x}")
return vocab| return [chr(i) for i in range(256)] | |
| vocab = [] | |
| for i in range(256): | |
| if 32 <= i < 127: # Printable ASCII range | |
| vocab.append(chr(i)) | |
| else: | |
| vocab.append(f"\\x{i:02x}") | |
| return vocab |
| vocab = [] | ||
| seen = set() | ||
| for token_id in range(self.enc.n_vocab): | ||
| token_bytes = self.enc.decode_single_token_bytes(token_id) | ||
| token_str = token_bytes.decode('utf-8', errors='replace') | ||
| if token_str not in seen: | ||
| seen.add(token_str) | ||
| vocab.append(token_str) | ||
|
|
||
| # Include known special tokens (base and additional) | ||
| special_tokens = {} | ||
| if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict): | ||
| special_tokens.update(self.enc._special_tokens) | ||
| special_tokens.update(self.special_tokens) | ||
|
|
||
| for token in special_tokens.keys(): | ||
| if token not in seen: | ||
| seen.add(token) | ||
| vocab.append(token) | ||
|
|
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The deduplication logic using seen set may hide issues where multiple token IDs decode to the same string (due to errors='replace'). This could result in a vocabulary that doesn't accurately represent all token IDs in the tokenizer.
Additionally, this means the returned vocabulary list length may not match self.enc.n_vocab, which could be confusing for users who expect a 1:1 mapping. Consider documenting this behavior in the method's docstring or removing the deduplication to preserve all token representations.
| vocab = [] | |
| seen = set() | |
| for token_id in range(self.enc.n_vocab): | |
| token_bytes = self.enc.decode_single_token_bytes(token_id) | |
| token_str = token_bytes.decode('utf-8', errors='replace') | |
| if token_str not in seen: | |
| seen.add(token_str) | |
| vocab.append(token_str) | |
| # Include known special tokens (base and additional) | |
| special_tokens = {} | |
| if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict): | |
| special_tokens.update(self.enc._special_tokens) | |
| special_tokens.update(self.special_tokens) | |
| for token in special_tokens.keys(): | |
| if token not in seen: | |
| seen.add(token) | |
| vocab.append(token) | |
| """ | |
| Returns the vocabulary as a list of strings, where each entry corresponds to the decoded string for each token ID. | |
| The length of the returned list matches self.enc.n_vocab, so vocab[i] is the string for token ID i. | |
| Special tokens are not included in this list. | |
| """ | |
| vocab = [] | |
| for token_id in range(self.enc.n_vocab): | |
| token_bytes = self.enc.decode_single_token_bytes(token_id) | |
| token_str = token_bytes.decode('utf-8', errors='replace') | |
| vocab.append(token_str) |
| vocab = tokenizer.get_vocabulary() | ||
| # Ensure string representations for all tokens | ||
| vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab] | ||
| vocab_strings.sort(key=lambda token: (-len(token), token)) |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The sorting strategy sort(key=lambda token: (-len(token), token)) sorts by descending length first, then alphabetically. While this may be intentional for some use cases, the choice of sorting order is not documented.
For typical vocabulary files, users might expect alphabetical sorting or sorting by token ID order. Consider documenting why this particular sorting order was chosen, or making it configurable via a command-line option.
| if args.extract_vocab: | ||
| if not hasattr(tokenizer, "get_vocabulary"): | ||
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | ||
|
|
||
| vocab = tokenizer.get_vocabulary() |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Vocabulary extraction occurs after full tokenization of the training and validation data (lines 121-154), which is inefficient if users only want to inspect the vocabulary without tokenizing their entire dataset.
For large datasets, this could result in unnecessary processing time. Consider allowing vocabulary extraction to occur before tokenization when the --extract-vocab flag is set and --skip_tokenization is specified, or provide a dedicated vocabulary-only mode that skips data tokenization entirely.
| if not hasattr(tokenizer, "get_vocabulary"): | ||
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | ||
|
|
||
| vocab = tokenizer.get_vocabulary() |
Copilot
AI
Nov 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hasattr check will always return True because get_vocabulary is defined as an abstract method in the base Tokenizer class. This means the error will never be raised even for tokenizers that don't support vocabulary extraction.
Consider checking if the method raises NotImplementedError instead, or check for specific tokenizer classes. For example:
if args.extract_vocab:
try:
vocab = tokenizer.get_vocabulary()
except NotImplementedError:
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")| if not hasattr(tokenizer, "get_vocabulary"): | |
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") | |
| vocab = tokenizer.get_vocabulary() | |
| try: | |
| vocab = tokenizer.get_vocabulary() | |
| except NotImplementedError: | |
| raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.") |
This pull request adds support for extracting and exporting the vocabulary of different tokenizer methods to a JSON file, improving transparency and debugging capabilities for tokenization. The main changes introduce a new command-line option, implement
get_vocabularymethods for each tokenizer class, and update the main script logic to handle vocabulary extraction.Vocabulary extraction feature:
--extract-vocabtoprepare.py, allowing users to export the tokenizer vocabulary to a JSON file.prepare.pyto check for theextract_vocabflag, invoke the tokenizer’sget_vocabularymethod, and write the sorted vocabulary to a JSON file. Includes error handling for unsupported tokenizers.Tokenizer class updates:
get_vocabularymethod to the baseTokenizerclass, and implemented it for all tokenizer subclasses. Each implementation returns the vocabulary in a format appropriate to the tokenizer (e.g., list of strings, decoded bytes, or character sets). [1] [2] [3] [4] [5] [6] [7] [8]