Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions data/template/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ def parse_arguments():

# Additional options
parser.add_argument("-T", "--track_token_counts", action="store_true", help="Track how often each token appears and store in meta.pkl")
parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The command-line argument uses a hyphen (--extract-vocab) which is inconsistent with Python conventions. The argument will be accessible as args.extract_vocab (with underscore) due to argparse's automatic conversion, but for consistency with other arguments in the file (e.g., --track_token_counts), consider using underscores in the flag name: --extract_vocab.

Suggested change
parser.add_argument("-E", "--extract-vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")
parser.add_argument("-E", "--extract_vocab", action="store_true", help="Export the tokenizer vocabulary to a JSON file")

Copilot uses AI. Check for mistakes.

return parser.parse_args()

Expand Down Expand Up @@ -152,6 +153,20 @@ def main():
if val_ids is not None:
save_tokens(val_ids, args.val_output, dtype)

if args.extract_vocab:
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

vocab = tokenizer.get_vocabulary()
Comment on lines +156 to +160
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Vocabulary extraction occurs after full tokenization of the training and validation data (lines 121-154), which is inefficient if users only want to inspect the vocabulary without tokenizing their entire dataset.

For large datasets, this could result in unnecessary processing time. Consider allowing vocabulary extraction to occur before tokenization when the --extract-vocab flag is set and --skip_tokenization is specified, or provide a dedicated vocabulary-only mode that skips data tokenization entirely.

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +160
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasattr check will always return True because get_vocabulary is defined as an abstract method in the base Tokenizer class. This means the error will never be raised even for tokenizers that don't support vocabulary extraction.

Consider checking if the method raises NotImplementedError instead, or check for specific tokenizer classes. For example:

if args.extract_vocab:
    try:
        vocab = tokenizer.get_vocabulary()
    except NotImplementedError:
        raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
Suggested change
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
try:
vocab = tokenizer.get_vocabulary()
except NotImplementedError:
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")

Copilot uses AI. Check for mistakes.
# Ensure string representations for all tokens
vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
vocab_strings.sort(key=lambda token: (-len(token), token))
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The sorting strategy sort(key=lambda token: (-len(token), token)) sorts by descending length first, then alphabetically. While this may be intentional for some use cases, the choice of sorting order is not documented.

For typical vocabulary files, users might expect alphabetical sorting or sorting by token ID order. Consider documenting why this particular sorting order was chosen, or making it configurable via a command-line option.

Copilot uses AI. Check for mistakes.

vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab_strings, f, ensure_ascii=False, indent=2)
Comment on lines +157 to +167
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SineWaveTokenizer class doesn't inherit from the Tokenizer base class and therefore doesn't have a get_vocabulary method. When --extract-vocab is used with --method sinewave, the code will fail at line 160 when calling tokenizer.get_vocabulary() with an AttributeError.

Either SineWaveTokenizer should inherit from Tokenizer and implement get_vocabulary, or the vocabulary extraction logic in prepare.py should explicitly handle the sinewave case.

Suggested change
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab_strings = [token if isinstance(token, str) else str(token) for token in vocab]
vocab_strings.sort(key=lambda token: (-len(token), token))
vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab_strings, f, ensure_ascii=False, indent=2)
if args.method == "sinewave":
# SineWaveTokenizer does not have get_vocabulary; generate vocabulary as 0-255
vocab = [str(i) for i in range(256)]
else:
if not hasattr(tokenizer, "get_vocabulary"):
raise ValueError(f"Vocabulary extraction is not supported for tokenizer method '{args.method}'.")
vocab = tokenizer.get_vocabulary()
# Ensure string representations for all tokens
vocab = [token if isinstance(token, str) else str(token) for token in vocab]
vocab.sort(key=lambda token: (-len(token), token))
vocab_filename = f"{args.method}_vocab.json"
with open(vocab_filename, 'w', encoding='utf-8') as f:
json.dump(vocab, f, ensure_ascii=False, indent=2)

Copilot uses AI. Check for mistakes.
print(f"Saved vocabulary to {vocab_filename}")

if args.method == "sinewave":
meta = {
"tokenizer": "sinewave",
Expand Down
59 changes: 59 additions & 0 deletions data/template/tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ def finalize_meta(self, meta):
meta["token_counts"] = dict(self.token_counts)
self.save_meta(meta)

def get_vocabulary(self):
"""Return the list of string representations that make up the tokenizer's vocabulary."""
raise NotImplementedError("Vocabulary extraction is not implemented for this tokenizer.")

Comment on lines +38 to +41
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new get_vocabulary() method lacks test coverage. Since the repository has comprehensive tests for other tokenizer methods, consider adding tests to verify that:

  1. get_vocabulary() returns the expected format for each tokenizer type
  2. The vocabulary can be successfully serialized to JSON
  3. Special cases (empty vocabularies, special tokens, byte representations) are handled correctly
  4. The vocabulary extraction flag in prepare.py works end-to-end

Copilot uses AI. Check for mistakes.
@staticmethod
def get_key_from_meta(keyname):
meta_path = 'meta.pkl'
Expand Down Expand Up @@ -117,6 +121,11 @@ def detokenize(self, ids):
raise ValueError("SentencePiece model is not loaded.")
return self.sp.decode_ids(ids)

def get_vocabulary(self):
if not self.sp:
raise ValueError("SentencePiece model is not loaded.")
return [self.sp.id_to_piece(i) for i in range(self.sp.GetPieceSize())]

class TiktokenTokenizer(Tokenizer):
def __init__(self, args):
super().__init__(args)
Expand Down Expand Up @@ -218,6 +227,29 @@ def detokenize(self, token_ids):

return ''.join(result)

def get_vocabulary(self):
vocab = []
seen = set()
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
if token_str not in seen:
seen.add(token_str)
vocab.append(token_str)

# Include known special tokens (base and additional)
special_tokens = {}
if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):
special_tokens.update(self.enc._special_tokens)
special_tokens.update(self.special_tokens)

for token in special_tokens.keys():
if token not in seen:
seen.add(token)
vocab.append(token)

Comment on lines +231 to +250
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The deduplication logic using seen set may hide issues where multiple token IDs decode to the same string (due to errors='replace'). This could result in a vocabulary that doesn't accurately represent all token IDs in the tokenizer.

Additionally, this means the returned vocabulary list length may not match self.enc.n_vocab, which could be confusing for users who expect a 1:1 mapping. Consider documenting this behavior in the method's docstring or removing the deduplication to preserve all token representations.

Suggested change
vocab = []
seen = set()
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
if token_str not in seen:
seen.add(token_str)
vocab.append(token_str)
# Include known special tokens (base and additional)
special_tokens = {}
if hasattr(self.enc, "_special_tokens") and isinstance(self.enc._special_tokens, dict):
special_tokens.update(self.enc._special_tokens)
special_tokens.update(self.special_tokens)
for token in special_tokens.keys():
if token not in seen:
seen.add(token)
vocab.append(token)
"""
Returns the vocabulary as a list of strings, where each entry corresponds to the decoded string for each token ID.
The length of the returned list matches self.enc.n_vocab, so vocab[i] is the string for token ID i.
Special tokens are not included in this list.
"""
vocab = []
for token_id in range(self.enc.n_vocab):
token_bytes = self.enc.decode_single_token_bytes(token_id)
token_str = token_bytes.decode('utf-8', errors='replace')
vocab.append(token_str)

Copilot uses AI. Check for mistakes.
return vocab


class CustomTokenizer(Tokenizer):
def __init__(self, args):
Expand Down Expand Up @@ -261,6 +293,9 @@ def tokenize(self, data):
def detokenize(self, ids):
return ''.join([self.itos[id] for id in ids])

def get_vocabulary(self):
return list(self.tokens)

class ByteTokenizer(Tokenizer):
def __init__(self, args):
super().__init__(args)
Expand All @@ -281,6 +316,9 @@ def tokenize(self, data):
def detokenize(self, ids):
return bytes(ids).decode('utf-8', errors='replace')

def get_vocabulary(self):
return [chr(i) for i in range(256)]
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The ByteTokenizer.get_vocabulary() method returns characters using chr(i) for byte values 0-255. However, many byte values in this range (0-31, 127-159) are control characters that don't have printable representations. Some values like 0-31 and 127 are non-printable ASCII control characters, and attempting to use chr() on values 128-255 may produce unexpected Unicode characters.

This could lead to issues when the vocabulary is serialized to JSON. Consider using a more appropriate representation for non-printable bytes, such as their hexadecimal or escaped form:

def get_vocabulary(self):
    vocab = []
    for i in range(256):
        if 32 <= i < 127:  # Printable ASCII range
            vocab.append(chr(i))
        else:
            vocab.append(f"\\x{i:02x}")
    return vocab
Suggested change
return [chr(i) for i in range(256)]
vocab = []
for i in range(256):
if 32 <= i < 127: # Printable ASCII range
vocab.append(chr(i))
else:
vocab.append(f"\\x{i:02x}")
return vocab

Copilot uses AI. Check for mistakes.


class CharTokenizer(Tokenizer):
def __init__(self, args, train_data, val_data):
Expand Down Expand Up @@ -315,6 +353,9 @@ def tokenize(self, data):
def detokenize(self, ids):
return ''.join([self.itos[id] for id in ids])

def get_vocabulary(self):
return list(self.chars)

class CustomCharTokenizerWithByteFallback(Tokenizer):
"""
In this version, we assign IDs 0..255 to raw bytes,
Expand Down Expand Up @@ -445,6 +486,15 @@ def detokenize(self, ids):

return ''.join(out_pieces)

def get_vocabulary(self):
vocab = []
for token in self.itos.values():
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocab
Comment on lines +489 to +496
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self):
    vocab = []
    for token_id in sorted(self.itos.keys()):
        token = self.itos[token_id]
        if isinstance(token, bytes):
            vocab.append(token.decode('utf-8', errors='replace'))
        else:
            vocab.append(token)
    return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot uses AI. Check for mistakes.

class JsonByteTokenizerWithByteFallback(Tokenizer):
"""
Similar to CustomCharTokenizerWithByteFallback, but loads tokens from a JSON array.
Expand Down Expand Up @@ -577,6 +627,15 @@ def detokenize(self, ids):

return ''.join(out_pieces)

def get_vocabulary(self):
vocab = []
for token in self.itos.values():
if isinstance(token, bytes):
vocab.append(token.decode('utf-8', errors='replace'))
else:
vocab.append(token)
return vocab
Comment on lines +630 to +637
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocabulary is built by iterating over self.itos.values(), which returns dictionary values in arbitrary order (though insertion order is preserved in Python 3.7+). Since itos is a mapping from token IDs to tokens, consider iterating over sorted token IDs to ensure a consistent and predictable vocabulary order:

def get_vocabulary(self):
    vocab = []
    for token_id in sorted(self.itos.keys()):
        token = self.itos[token_id]
        if isinstance(token, bytes):
            vocab.append(token.decode('utf-8', errors='replace'))
        else:
            vocab.append(token)
    return vocab

This ensures that the vocabulary reflects the actual token ID ordering, which is more useful for debugging and understanding the tokenizer.

Copilot uses AI. Check for mistakes.


class SineWaveTokenizer:
"""Generate a deterministic sequence of sine wave samples."""
Expand Down
Loading