Skip to content

[Question] Tokenizer is not counted in submission size #43

@DouglasOrr

Description

@DouglasOrr

The submission size calculation doesn't count persisting the tokenizer itself - is this right? Two challenges to this: first, the size includes everything necessary (modulo generic Python requirements) to fully specify and inference the model, so it seems a shame to lose this property by omitting the tokenizer. Second, not counting this allows large-vocabulary models an artificial advantage, since they get the strings "for free".

(However, I appreciate this may be a pragmatic decision, as although the fineweb_1024_bpe.vocab file is small, fineweb_1024_bpe.model is large, e.g. 150 kB compressed & is presumably larger than it needs to be. I presume the model isn't uniquely reconstructable from the vocab(?))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions