Allow for tokenizers/preprocessors to change batch size#866
Allow for tokenizers/preprocessors to change batch size#866Aphoh wants to merge 2 commits intomarin-community:mainfrom
Conversation
|
thanks for this. I think it's not quite right in the case of preemption. In particular, we use open_shard_at_row with this value when we resume tokenization. What we would need to do is save both "rows_in" and "rows_out" and only use rows_in for open_shard_at_row. Does that make sense? |
|
@dlwh ah yeah totally! I'll try to get that worked in. |
dlwh
left a comment
There was a problem hiding this comment.
sorry, one small rename then i'm happy
| total_num_rows: int | ||
| shard_rows: Dict[str, int] | ||
| """Number of outputted rows in the cache""" | ||
| shard_rows_in: Dict[str, int] |
There was a problem hiding this comment.
sorry, can we rename this one back to just shard_rows (leave comment) so that we don't invalidate all other caches
There was a problem hiding this comment.
Ah gotcha yeah... Do I need to do some other logic to make shard_rows_out an optional key during deserialization?
There was a problem hiding this comment.
i'm pretty sure if you give it a default value it will be fine. I'll check it against a cache before merging
Small change that keeps changes how the counting for the number of batches is done. Instead of assuming$n$ examples from shard iterator means $n$ examples after tokenizing/processing, it gets the length from the output of the tokenizer. I had a use case that involved combining multiple samples together during the processing stage, and ran into a bug where the number of batches stored in the offsets was incorrect.