Skip to content

Conversation

kefranabg
Copy link
Contributor

Context:

A lot of deepseek models have unconventional number of digits for the total shard number in their filename, which leads to a filename parsing fail. It breaks the file info navigation on the hub:

Kapture.2025-10-21.at.11.35.17.mp4

here for instance, model-00001-of-000163.safetensors should be model-00001-of-00163.safetensors (there's 1 extra "0" in "model-00001-of-.safetensors")

Most of deepseek models with a lot of shards have this issue, since they are important models, I suggest we make the regex more flexible to accept a variable number of digits for shards in the filename.

Would that be acceptable?

@kefranabg kefranabg requested a review from coyotte508 as a code owner October 21, 2025 09:45
@kefranabg kefranabg requested review from mishig25 and removed request for coyotte508 October 21, 2025 09:45
@kefranabg kefranabg force-pushed the improve-parse-safetensor-shard-name branch from 889215b to 5eec567 Compare October 21, 2025 09:49
export const RE_SAFETENSORS_INDEX_FILE = /\.safetensors\.index\.json$/;
export const RE_SAFETENSORS_SHARD_FILE =
/^(?<prefix>(?<basePrefix>.*?)[_-])(?<shard>\d{5})-of-(?<total>\d{5})\.safetensors$/;
/^(?<prefix>(?<basePrefix>.*?)[_-])(?<shard>\d+)-of-(?<total>\d+)\.safetensors$/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least enforce the same number of digits 🙏 (not sure how to do this in the regex, maybe it's an OR or the different digits for 5 and 6 digits)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we just don't care about the 5X and just allow regular numbers to avoid this issue again in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would keep some convention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case maybe it's better to handle this edge case directly on hub side and fix the shard digit numbers before passing it to parseSafetensorsShardFilename no?

Like transforming 000163 to 00163 before parsing it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm no i think here is fine (we use this hub-side anyways)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants