-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
The submission size calculation doesn't count persisting the tokenizer itself - is this right? Two challenges to this: first, the size includes everything necessary (modulo generic Python requirements) to fully specify and inference the model, so it seems a shame to lose this property by omitting the tokenizer. Second, not counting this allows large-vocabulary models an artificial advantage, since they get the strings "for free".
(However, I appreciate this may be a pragmatic decision, as although the fineweb_1024_bpe.vocab file is small, fineweb_1024_bpe.model is large, e.g. 150 kB compressed & is presumably larger than it needs to be. I presume the model isn't uniquely reconstructable from the vocab(?))
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels