Brief analysis of different tokenizers #79

bigwolfeman · 2025-12-23T14:06:06Z

bigwolfeman
Dec 23, 2025

Starcoder2 and Phi-3 are strong standouts.
This was conducted on a my local setup using the current training script to 8m tokens. Run to run variance I have been seeing is about 1.5s

bigwolfeman · 2025-12-23T14:11:23Z

bigwolfeman
Dec 23, 2025
Author

0 replies

vukrosic · 2025-12-23T22:35:20Z

vukrosic
Dec 23, 2025
Maintainer

This is interesting idea, however those 8M tokens could accidentally fit a certain tokenizer while the full 1B might be more suitable for a different optimizer. This would need more research but it's a good start. Thank you for the contribution.

1 reply

bigwolfeman Dec 24, 2025
Author

I will run each out to 1b tokens tonight. It takes awhile to retokenize the data for each run.

bigwolfeman · 2025-12-24T07:48:34Z

bigwolfeman
Dec 24, 2025
Author

strong results. The load on my gpu was a little higher during some of this run, the time differences are not very reliable as a result. The Phi and Mistral should be lower than they are. I have a strong preference toward Starcoder2. They have chunked numbers up in a way that dramatically improves math ability of the model from other testing I have done in the past.

This is run out in the same fashion as before to 110m tokens.

3 replies

vukrosic Dec 24, 2025
Maintainer

Seems interesting, we will use it later, not sure when we will get into this but it's good that it's here

vukrosic Dec 24, 2025
Maintainer

Tokenizers are important to research as well

bigwolfeman Dec 25, 2025
Author

If you need it, I can retokenize the entire dataset. I have a lot of CPU to throw at it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Brief analysis of different tokenizers #79

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Brief analysis of different tokenizers #79

Uh oh!

Uh oh!

bigwolfeman Dec 23, 2025

Replies: 3 comments · 4 replies

Uh oh!

bigwolfeman Dec 23, 2025 Author

Uh oh!

vukrosic Dec 23, 2025 Maintainer

Uh oh!

bigwolfeman Dec 24, 2025 Author

Uh oh!

bigwolfeman Dec 24, 2025 Author

Uh oh!

vukrosic Dec 24, 2025 Maintainer

Uh oh!

vukrosic Dec 24, 2025 Maintainer

Uh oh!

Uh oh!

bigwolfeman Dec 25, 2025 Author

bigwolfeman
Dec 23, 2025

Replies: 3 comments 4 replies

bigwolfeman
Dec 23, 2025
Author

vukrosic
Dec 23, 2025
Maintainer

bigwolfeman Dec 24, 2025
Author

bigwolfeman
Dec 24, 2025
Author

vukrosic Dec 24, 2025
Maintainer

vukrosic Dec 24, 2025
Maintainer

bigwolfeman Dec 25, 2025
Author