Skip to content

Conversation

@GeorgiosSmyrnis
Copy link
Collaborator

This enables mixing of pretokenized data with the tokenize_shuffle.py script. This is allowed by the --pretok_tars flag, which assumes that the tarfiles that the script contain already tokenized data.

@GeorgiosSmyrnis
Copy link
Collaborator Author

This now also fixes a rare issue where the dataset produced by tokenize shuffle becomes broken due to duplicate file names within the tarfiles. While this could only happen if tokenizing the same sequence of tokens, this now converts the naming scheme within a tarfile to a simple name of the format {shard_index}_{iterator}.

@GeorgiosSmyrnis GeorgiosSmyrnis force-pushed the gsmyrnis/shuffle_pretokenized branch from 0f77f45 to c5c2b1b Compare March 10, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants