[Question] Tokenizer is not counted in submission size

The submission size calculation doesn't count persisting the tokenizer itself - is this right? Two challenges to this: first, the size includes everything necessary (modulo generic Python requirements) to fully specify and inference the model, so it seems a shame to lose this property by omitting the tokenizer. Second, not counting this allows large-vocabulary models an artificial advantage, since they get the strings "for free".

(However, I appreciate this may be a pragmatic decision, as although the `fineweb_1024_bpe.vocab` file is small, `fineweb_1024_bpe.model` is large, e.g. 150 kB compressed & is presumably larger than it needs to be. I presume the model isn't uniquely reconstructable from the vocab(?))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Tokenizer is not counted in submission size #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Tokenizer is not counted in submission size #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions