Add vocabulary extraction option to tokenization tools #681

klei22 · 2025-11-15T09:12:21Z

This pull request adds support for extracting and exporting the vocabulary of different tokenizer methods to a JSON file, improving transparency and debugging capabilities for tokenization. The main changes introduce a new command-line option, implement get_vocabulary methods for each tokenizer class, and update the main script logic to handle vocabulary extraction.

Vocabulary extraction feature:

Added a new command-line argument --extract-vocab to prepare.py, allowing users to export the tokenizer vocabulary to a JSON file.
Updated the main logic in prepare.py to check for the extract_vocab flag, invoke the tokenizer’s get_vocabulary method, and write the sorted vocabulary to a JSON file. Includes error handling for unsupported tokenizers.

Tokenizer class updates:

Added an abstract get_vocabulary method to the base Tokenizer class, and implemented it for all tokenizer subclasses. Each implementation returns the vocabulary in a format appropriate to the tokenizer (e.g., list of strings, decoded bytes, or character sets). [1] [2] [3] [4] [5] [6] [7] [8]

Copilot

Pull Request Overview

This pull request adds vocabulary extraction functionality to the tokenization tools, allowing users to export tokenizer vocabularies to JSON files for inspection and debugging. The implementation adds a new --extract-vocab CLI flag and implements vocabulary retrieval methods across all tokenizer classes.

Key changes:

Added abstract get_vocabulary() method to the base Tokenizer class with implementations for all tokenizer subclasses (SentencePiece, Tiktoken, Custom, Byte, Char, and byte fallback variants)
Introduced --extract-vocab command-line argument to enable vocabulary export functionality
Implemented vocabulary extraction logic in prepare.py that serializes vocabularies to JSON files with custom sorting

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File	Description
data/template/tokenizers.py	Added abstract `get_vocabulary()` method to base `Tokenizer` class and implemented it for all tokenizer subclasses (SentencePieceTokenizer, TiktokenTokenizer, CustomTokenizer, ByteTokenizer, CharTokenizer, CustomCharTokenizerWithByteFallback, JsonByteTokenizerWithByteFallback)
data/template/prepare.py	Added `--extract-vocab` CLI argument and vocabulary extraction logic that calls `get_vocabulary()`, converts tokens to strings, sorts by length/alphabetically, and exports to JSON

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-15T09:17:02Z

data/template/tokenizers.py

+    def get_vocabulary(self):
+        vocab = []
+        for token in self.itos.values():
+            if isinstance(token, bytes):
+                vocab.append(token.decode('utf-8', errors='replace'))
+            else:
+                vocab.append(token)
+        return vocab


The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self): vocab = [] for token_id in sorted(self.itos.keys()): token = self.itos[token_id] if isinstance(token, bytes): vocab.append(token.decode('utf-8', errors='replace')) else: vocab.append(token) return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot · 2025-11-15T09:17:02Z

data/template/tokenizers.py

+    def get_vocabulary(self):
+        vocab = []
+        for token in self.itos.values():
+            if isinstance(token, bytes):
+                vocab.append(token.decode('utf-8', errors='replace'))
+            else:
+                vocab.append(token)
+        return vocab


The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self): vocab = [] for token_id in sorted(self.itos.keys()): token = self.itos[token_id] if isinstance(token, bytes): vocab.append(token.decode('utf-8', errors='replace')) else: vocab.append(token) return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot · 2025-11-15T09:17:02Z

data/template/tokenizers.py

+    def get_vocabulary(self):
+        """Return the list of string representations that make up the tokenizer's vocabulary."""
+        raise NotImplementedError("Vocabulary extraction is not implemented for this tokenizer.")
+


The new get_vocabulary() method lacks test coverage. Since the repository has comprehensive tests for other tokenizer methods, consider adding tests to verify that:

get_vocabulary() returns the expected format for each tokenizer type

The vocabulary can be successfully serialized to JSON

Special cases (empty vocabularies, special tokens, byte representations) are handled correctly

The vocabulary extraction flag in prepare.py works end-to-end

Copilot · 2025-11-15T09:17:02Z

data/template/prepare.py


    # Additional options
    parser.add_argument("-T", "--track_token_counts", action="store_true", help="Track how often each token appears and store in meta.pkl")
+    parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")


[nitpick] The command-line argument uses a hyphen (--extract-vocab) which is inconsistent with Python conventions. The argument will be accessible as args.extract_vocab (with underscore) due to argparse's automatic conversion, but for consistency with other arguments in the file (e.g., --track_token_counts), consider using underscores in the flag name: --extract_vocab.

Suggested change

parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")

parser.add_argument("-E", "--extract_vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")

Copilot · 2025-11-15T09:17:02Z

data/template/prepare.py

+        if not hasattr(tokenizer, "get_vocabulary"):
+            raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
+
+        vocab = tokenizer.get_vocabulary()
+        # Ensure string representations for all tokens
+        vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
+        vocab_strings.sort(key=lambda token: (-len(token), token))
+
+        vocab_filename = f"{args.method}_vocab.json"
+        with open(vocab_filename, 'w', encoding='utf-8') as f:
+            json.dump(vocab_strings, f, ensure_ascii=False, indent=2)


The SineWaveTokenizer class doesn't inherit from the Tokenizer base class and therefore doesn't have a get_vocabulary method. When --extract-vocab is used with --method sinewave, the code will fail at line 160 when calling tokenizer.get_vocabulary() with an AttributeError.

Either SineWaveTokenizer should inherit from Tokenizer and implement get_vocabulary, or the vocabulary extraction logic in prepare.py should explicitly handle the sinewave case.

Suggested change

if not hasattr(tokenizer, "get_vocabulary"):

raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()

# Ensure string representations for all tokens

vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]

vocab_strings.sort(key=lambda token: (-len(token), token))

vocab_filename = f"{args.method}_vocab.json"

with open(vocab_filename, 'w', encoding='utf-8') as f:

json.dump(vocab_strings, f, ensure_ascii=False, indent=2)

if args.method == "sinewave":

# SineWaveTokenizer does not have get_vocabulary; generate vocabulary as 0-255

vocab = [str(i) for i in range(256)]

else:

if not hasattr(tokenizer, "get_vocabulary"):

raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()

# Ensure string representations for all tokens

vocab = [token if isinstance(token, str) else str(token) for token in vocab]

vocab.sort(key=lambda token: (-len(token), token))

vocab_filename = f"{args.method}_vocab.json"

with open(vocab_filename, 'w', encoding='utf-8') as f:

json.dump(vocab, f, ensure_ascii=False, indent=2)

Copilot · 2025-11-15T09:17:03Z

data/template/tokenizers.py

        return bytes(ids).decode('utf-8', errors='replace')

+    def get_vocabulary(self):
+        return [chr(i) for i in range(256)]


[nitpick] The ByteTokenizer.get_vocabulary() method returns characters using chr(i) for byte values 0-255. However, many byte values in this range (0-31, 127-159) are control characters that don't have printable representations. Some values like 0-31 and 127 are non-printable ASCII control characters, and attempting to use chr() on values 128-255 may produce unexpected Unicode characters.

This could lead to issues when the vocabulary is serialized to JSON. Consider using a more appropriate representation for non-printable bytes, such as their hexadecimal or escaped form:

def get_vocabulary(self): vocab = [] for i in range(256): if 32 <= i < 127: # Printable ASCII range vocab.append(chr(i)) else: vocab.append(f"\\x{i:02x}") return vocab

Suggested change

return [chr(i) for i in range(256)]

vocab = []

for i in range(256):

if 32 <= i < 127: # Printable ASCII range

vocab.append(chr(i))

else:

vocab.append(f"\\x{i:02x}")

return vocab

Copilot · 2025-11-15T09:17:03Z

data/template/tokenizers.py

+        vocab = []
+        seen = set()
+        for token_id in range(self.enc.n_vocab):
+            token_bytes = self.enc.decode_single_token_bytes(token_id)
+            token_str = token_bytes.decode('utf-8', errors='replace')
+            if token_str not in seen:
+                seen.add(token_str)
+                vocab.append(token_str)
+
+        # Include known special tokens (base and additional)
+        special_tokens = {}
+        if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):
+            special_tokens.update(self.enc._special_tokens)
+        special_tokens.update(self.special_tokens)
+
+        for token in special_tokens.keys():
+            if token not in seen:
+                seen.add(token)
+                vocab.append(token)
+


[nitpick] The deduplication logic using seen set may hide issues where multiple token IDs decode to the same string (due to errors='replace'). This could result in a vocabulary that doesn't accurately represent all token IDs in the tokenizer.

Additionally, this means the returned vocabulary list length may not match self.enc.n_vocab, which could be confusing for users who expect a 1:1 mapping. Consider documenting this behavior in the method's docstring or removing the deduplication to preserve all token representations.

Suggested change

vocab = []

seen = set()

for token_id in range(self.enc.n_vocab):

token_bytes = self.enc.decode_single_token_bytes(token_id)

token_str = token_bytes.decode('utf-8', errors='replace')

if token_str not in seen:

seen.add(token_str)

vocab.append(token_str)

# Include known special tokens (base and additional)

special_tokens = {}

if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):

special_tokens.update(self.enc._special_tokens)

special_tokens.update(self.special_tokens)

for token in special_tokens.keys():

if token not in seen:

seen.add(token)

vocab.append(token)

"""

Returns the vocabulary as a list of strings, where each entry corresponds to the decoded string for each token ID.

The length of the returned list matches self.enc.n_vocab, so vocab[i] is the string for token ID i.

Special tokens are not included in this list.

"""

vocab = []

for token_id in range(self.enc.n_vocab):

token_bytes = self.enc.decode_single_token_bytes(token_id)

token_str = token_bytes.decode('utf-8', errors='replace')

vocab.append(token_str)

Copilot · 2025-11-15T09:17:03Z

data/template/prepare.py

+        vocab = tokenizer.get_vocabulary()
+        # Ensure string representations for all tokens
+        vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
+        vocab_strings.sort(key=lambda token: (-len(token), token))


[nitpick] The sorting strategy sort(key=lambda token: (-len(token), token)) sorts by descending length first, then alphabetically. While this may be intentional for some use cases, the choice of sorting order is not documented.

For typical vocabulary files, users might expect alphabetical sorting or sorting by token ID order. Consider documenting why this particular sorting order was chosen, or making it configurable via a command-line option.

Copilot · 2025-11-15T09:17:03Z

data/template/prepare.py

+    if args.extract_vocab:
+        if not hasattr(tokenizer, "get_vocabulary"):
+            raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
+
+        vocab = tokenizer.get_vocabulary()


[nitpick] Vocabulary extraction occurs after full tokenization of the training and validation data (lines 121-154), which is inefficient if users only want to inspect the vocabulary without tokenizing their entire dataset.

For large datasets, this could result in unnecessary processing time. Consider allowing vocabulary extraction to occur before tokenization when the --extract-vocab flag is set and --skip_tokenization is specified, or provide a dedicated vocabulary-only mode that skips data tokenization entirely.

Copilot · 2025-11-15T09:17:04Z

data/template/prepare.py

+        if not hasattr(tokenizer, "get_vocabulary"):
+            raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
+
+        vocab = tokenizer.get_vocabulary()


The hasattr check will always return True because get_vocabulary is defined as an abstract method in the base Tokenizer class. This means the error will never be raised even for tokenizers that don't support vocabulary extraction.

Consider checking if the method raises NotImplementedError instead, or check for specific tokenizer classes. For example:

if args.extract_vocab: try: vocab = tokenizer.get_vocabulary() except NotImplementedError: raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

Suggested change

if not hasattr(tokenizer, "get_vocabulary"):

raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()

try:

vocab = tokenizer.get_vocabulary()

except NotImplementedError:

raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

Add vocabulary extraction option to tokenization tools

954ddf9

klei22 requested review from Copilot and gkielian November 15, 2025 09:12

Copilot started reviewing on behalf of klei22 November 15, 2025 09:12 View session

Copilot finished reviewing on behalf of klei22 November 15, 2025 09:15

Copilot AI reviewed Nov 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vocabulary extraction option to tokenization tools #681

Add vocabulary extraction option to tokenization tools #681

Uh oh!

klei22 commented Nov 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")
	parser.add_argument("-E", "--extract_vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")

-        return [chr(i) for i in range(256)]
+        vocab = []
+        for i in range(256):
+            if 32 <= i < 127:  # Printable ASCII range
+                vocab.append(chr(i))
+            else:
+                vocab.append(f"\\x{i:02x}")
+        return vocab

Add vocabulary extraction option to tokenization tools #681

Are you sure you want to change the base?

Add vocabulary extraction option to tokenization tools #681

Uh oh!

Conversation

klei22 commented Nov 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant