Allow mixing for pretokenized data. #230

GeorgiosSmyrnis · 2024-03-08T20:31:57Z

This enables mixing of pretokenized data with the tokenize_shuffle.py script. This is allowed by the --pretok_tars flag, which assumes that the tarfiles that the script contain already tokenized data.

GeorgiosSmyrnis · 2024-03-10T19:15:14Z

This now also fixes a rare issue where the dataset produced by tokenize shuffle becomes broken due to duplicate file names within the tarfiles. While this could only happen if tokenizing the same sequence of tokens, this now converts the naming scheme within a tarfile to a simple name of the format {shard_index}_{iterator}.

GeorgiosSmyrnis requested review from Vaishaal and jeffreywpli March 8, 2024 20:31

GeorgiosSmyrnis self-assigned this Mar 8, 2024

GeorgiosSmyrnis added 7 commits March 10, 2024 14:39

Add support for pretokenized tars.

85a718f

Formatting.

96a9748

Add automated test.

69bbf4b

Debugging attempts.

e60874e

Debugging.

d200b85

Formatting.

a8282c8

Fix duplicate file names.

c5c2b1b

GeorgiosSmyrnis force-pushed the gsmyrnis/shuffle_pretokenized branch from 0f77f45 to c5c2b1b Compare March 10, 2024 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow mixing for pretokenized data. #230

Allow mixing for pretokenized data. #230

Uh oh!

GeorgiosSmyrnis commented Mar 8, 2024

Uh oh!

GeorgiosSmyrnis commented Mar 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow mixing for pretokenized data. #230

Are you sure you want to change the base?

Allow mixing for pretokenized data. #230

Uh oh!

Conversation

GeorgiosSmyrnis commented Mar 8, 2024

Uh oh!

GeorgiosSmyrnis commented Mar 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants