Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Nov 15, 2025

This pull request adds support for extracting and exporting the vocabulary of different tokenizer methods to a JSON file, improving transparency and debugging capabilities for tokenization. The main changes introduce a new command-line option, implement get_vocabulary methods for each tokenizer class, and update the main script logic to handle vocabulary extraction.

Vocabulary extraction feature:

  • Added a new command-line argument --extract-vocab to prepare.py, allowing users to export the tokenizer vocabulary to a JSON file.
  • Updated the main logic in prepare.py to check for the extract_vocab flag, invoke the tokenizer’s get_vocabulary method, and write the sorted vocabulary to a JSON file. Includes error handling for unsupported tokenizers.

Tokenizer class updates:

  • Added an abstract get_vocabulary method to the base Tokenizer class, and implemented it for all tokenizer subclasses. Each implementation returns the vocabulary in a format appropriate to the tokenizer (e.g., list of strings, decoded bytes, or character sets). [1] [2] [3] [4] [5] [6] [7] [8]

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds vocabulary extraction functionality to the tokenization tools, allowing users to export tokenizer vocabularies to JSON files for inspection and debugging. The implementation adds a new --extract-vocab CLI flag and implements vocabulary retrieval methods across all tokenizer classes.

Key changes:

  • Added abstract get_vocabulary() method to the base Tokenizer class with implementations for all tokenizer subclasses (SentencePiece, Tiktoken, Custom, Byte, Char, and byte fallback variants)
  • Introduced --extract-vocab command-line argument to enable vocabulary export functionality
  • Implemented vocabulary extraction logic in prepare.py that serializes vocabularies to JSON files with custom sorting

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
data/template/tokenizers.py Added abstract get_vocabulary() method to base Tokenizer class and implemented it for all tokenizer subclasses (SentencePieceTokenizer, TiktokenTokenizer, CustomTokenizer, ByteTokenizer, CharTokenizer, CustomCharTokenizerWithByteFallback, JsonByteTokenizerWithByteFallback)
data/template/prepare.py Added --extract-vocab CLI argument and vocabulary extraction logic that calls get_vocabulary(), converts tokens to strings, sorts by length/alphabetically, and exports to JSON

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +489 to +496
def get_vocabulary(self):
vocab = []
for token in self.itos.values():
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocab
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self):
    vocab = []
    for token_id in sorted(self.itos.keys()):
        token = self.itos[token_id]
        if isinstance(token, bytes):
            vocab.append(token.decode('utf-8', errors='replace'))
        else:
            vocab.append(token)
    return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot uses AI. Check for mistakes.
Comment on lines +630 to +637
def get_vocabulary(self):
vocab = []
for token in self.itos.values():
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocab
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self):
    vocab = []
    for token_id in sorted(self.itos.keys()):
        token = self.itos[token_id]
        if isinstance(token, bytes):
            vocab.append(token.decode('utf-8', errors='replace'))
        else:
            vocab.append(token)
    return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
def get_vocabulary(self):
"""Return the list of string representations that make up the tokenizer's vocabulary."""
raise NotImplementedError("Vocabulary extraction is not implemented for this tokenizer.")

Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new get_vocabulary() method lacks test coverage. Since the repository has comprehensive tests for other tokenizer methods, consider adding tests to verify that:

  1. get_vocabulary() returns the expected format for each tokenizer type
  2. The vocabulary can be successfully serialized to JSON
  3. Special cases (empty vocabularies, special tokens, byte representations) are handled correctly
  4. The vocabulary extraction flag in prepare.py works end-to-end

Copilot uses AI. Check for mistakes.

# Additional options
parser.add_argument("-T", "--track_token_counts", action="store_true", help="Track how often each token appears and store in meta.pkl")
parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The command-line argument uses a hyphen (--extract-vocab) which is inconsistent with Python conventions. The argument will be accessible as args.extract_vocab (with underscore) due to argparse's automatic conversion, but for consistency with other arguments in the file (e.g., --track_token_counts), consider using underscores in the flag name: --extract_vocab.

Suggested change
parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")
parser.add_argument("-E", "--extract_vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +167
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
vocab_strings.sort(key=lambda token: (-len(token), token))

vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab_strings, f, ensure_ascii=False, indent=2)
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SineWaveTokenizer class doesn't inherit from the Tokenizer base class and therefore doesn't have a get_vocabulary method. When --extract-vocab is used with --method sinewave, the code will fail at line 160 when calling tokenizer.get_vocabulary() with an AttributeError.

Either SineWaveTokenizer should inherit from Tokenizer and implement get_vocabulary, or the vocabulary extraction logic in prepare.py should explicitly handle the sinewave case.

Suggested change
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
vocab_strings.sort(key=lambda token: (-len(token), token))
vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab_strings, f, ensure_ascii=False, indent=2)
if args.method == "sinewave":
# SineWaveTokenizer does not have get_vocabulary; generate vocabulary as 0-255
vocab = [str(i) for i in range(256)]
else:
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab = [token if isinstance(token, str) else str(token) for token in vocab]
vocab.sort(key=lambda token: (-len(token), token))
vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab, f, ensure_ascii=False, indent=2)

Copilot uses AI. Check for mistakes.
return bytes(ids).decode('utf-8', errors='replace')

def get_vocabulary(self):
return [chr(i) for i in range(256)]
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The ByteTokenizer.get_vocabulary() method returns characters using chr(i) for byte values 0-255. However, many byte values in this range (0-31, 127-159) are control characters that don't have printable representations. Some values like 0-31 and 127 are non-printable ASCII control characters, and attempting to use chr() on values 128-255 may produce unexpected Unicode characters.

This could lead to issues when the vocabulary is serialized to JSON. Consider using a more appropriate representation for non-printable bytes, such as their hexadecimal or escaped form:

def get_vocabulary(self):
    vocab = []
    for i in range(256):
        if 32 <= i < 127:  # Printable ASCII range
            vocab.append(chr(i))
        else:
            vocab.append(f"\\x{i:02x}")
    return vocab
Suggested change
return [chr(i) for i in range(256)]
vocab = []
for i in range(256):
if 32 <= i < 127: # Printable ASCII range
vocab.append(chr(i))
else:
vocab.append(f"\\x{i:02x}")
return vocab

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +250
vocab = []
seen = set()
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
if token_str not in seen:
seen.add(token_str)
vocab.append(token_str)

# Include known special tokens (base and additional)
special_tokens = {}
if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):
special_tokens.update(self.enc._special_tokens)
special_tokens.update(self.special_tokens)

for token in special_tokens.keys():
if token not in seen:
seen.add(token)
vocab.append(token)

Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The deduplication logic using seen set may hide issues where multiple token IDs decode to the same string (due to errors='replace'). This could result in a vocabulary that doesn't accurately represent all token IDs in the tokenizer.

Additionally, this means the returned vocabulary list length may not match self.enc.n_vocab, which could be confusing for users who expect a 1:1 mapping. Consider documenting this behavior in the method's docstring or removing the deduplication to preserve all token representations.

Suggested change
vocab = []
seen = set()
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
if token_str not in seen:
seen.add(token_str)
vocab.append(token_str)
# Include known special tokens (base and additional)
special_tokens = {}
if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):
special_tokens.update(self.enc._special_tokens)
special_tokens.update(self.special_tokens)
for token in special_tokens.keys():
if token not in seen:
seen.add(token)
vocab.append(token)
"""
Returns the vocabulary as a list of strings, where each entry corresponds to the decoded string for each token ID.
The length of the returned list matches self.enc.n_vocab, so vocab[i] is the string for token ID i.
Special tokens are not included in this list.
"""
vocab = []
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
vocab.append(token_str)

Copilot uses AI. Check for mistakes.
vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
vocab_strings.sort(key=lambda token: (-len(token), token))
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The sorting strategy sort(key=lambda token: (-len(token), token)) sorts by descending length first, then alphabetically. While this may be intentional for some use cases, the choice of sorting order is not documented.

For typical vocabulary files, users might expect alphabetical sorting or sorting by token ID order. Consider documenting why this particular sorting order was chosen, or making it configurable via a command-line option.

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +160
if args.extract_vocab:
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Vocabulary extraction occurs after full tokenization of the training and validation data (lines 121-154), which is inefficient if users only want to inspect the vocabulary without tokenizing their entire dataset.

For large datasets, this could result in unnecessary processing time. Consider allowing vocabulary extraction to occur before tokenization when the --extract-vocab flag is set and --skip_tokenization is specified, or provide a dedicated vocabulary-only mode that skips data tokenization entirely.

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +160
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasattr check will always return True because get_vocabulary is defined as an abstract method in the base Tokenizer class. This means the error will never be raised even for tokenizers that don't support vocabulary extraction.

Consider checking if the method raises NotImplementedError instead, or check for specific tokenizer classes. For example:

if args.extract_vocab:
    try:
        vocab = tokenizer.get_vocabulary()
    except NotImplementedError:
        raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
Suggested change
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
try:
vocab = tokenizer.get_vocabulary()
except NotImplementedError:
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant