Allow for tokenizers/preprocessors to change batch size by Aphoh · Pull Request #866 · marin-community/levanter

Aphoh · 2025-01-24T05:05:51Z

Small change that keeps changes how the counting for the number of batches is done. Instead of assuming $n$ examples from shard iterator means $n$ examples after tokenizing/processing, it gets the length from the output of the tokenizer. I had a use case that involved combining multiple samples together during the processing stage, and ran into a bug where the number of batches stored in the offsets was incorrect.

dlwh · 2025-01-24T06:31:45Z

thanks for this. I think it's not quite right in the case of preemption. In particular, we use open_shard_at_row with this value when we resume tokenization. What we would need to do is save both "rows_in" and "rows_out" and only use rows_in for open_shard_at_row. Does that make sense?

Aphoh · 2025-01-24T06:48:32Z

@dlwh ah yeah totally! I'll try to get that worked in.

dlwh

sorry, one small rename then i'm happy

dlwh · 2025-01-24T17:48:19Z

src/levanter/store/cache.py

    total_num_rows: int
-    shard_rows: Dict[str, int]
+    """Number of outputted rows in the cache"""
+    shard_rows_in: Dict[str, int]


sorry, can we rename this one back to just shard_rows (leave comment) so that we don't invalidate all other caches

Ah gotcha yeah... Do I need to do some other logic to make shard_rows_out an optional key during deserialization?

i'm pretty sure if you give it a default value it will be fine. I'll check it against a cache before merging

Allow for tokenizers/preprocessors to change batch size

3462813

Track in/out batches separately for correct resumption

900f50f

dlwh reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for tokenizers/preprocessors to change batch size#866

Allow for tokenizers/preprocessors to change batch size#866
Aphoh wants to merge 2 commits intomarin-community:mainfrom
Aphoh:aphoh/variable-len-tokenizer

Aphoh commented Jan 24, 2025

Uh oh!

dlwh commented Jan 24, 2025

Uh oh!

Aphoh commented Jan 24, 2025

Uh oh!

dlwh left a comment

Uh oh!

dlwh Jan 24, 2025

Uh oh!

Aphoh Jan 25, 2025

Uh oh!

dlwh Jan 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aphoh commented Jan 24, 2025

Uh oh!

dlwh commented Jan 24, 2025

Uh oh!

Aphoh commented Jan 24, 2025

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

dlwh Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

Aphoh Jan 25, 2025

Choose a reason for hiding this comment

Uh oh!

dlwh Jan 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants