Standardizing Benchmarking to 1b tokens #85

bigwolfeman · 2025-12-25T04:19:08Z

bigwolfeman
Dec 25, 2025

Problem: Loss is not a reliable number for final model performance.

Solution: Benchmarks are better metrics for final model performance.

Can we benchmark 500m or 1b token trained models in a meaningful way to better determine model performance in a more iterative methodology? Vuk is showing almost 3 hours for 1b token training. This is still quite long for iterative workflows.

Experiment:
I will train out several models to 1b tokens, benchmark them ever 100m tokens. I will perhaps do this 3 times for each architecture using 3 different seeds to get a very rough idea on the kind of spread we are looking at. If the spread is low enough and we see each model achieving its own regiment with in the space this is a reliable way to benchmark changes while ignoring loss. Hopefully 500m is enough to see clear patterns emerging.

I will record each to the same wandb project using standard metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardizing Benchmarking to 1b tokens #85

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Standardizing Benchmarking to 1b tokens #85

Uh oh!

Uh oh!

bigwolfeman Dec 25, 2025

Replies: 0 comments

bigwolfeman
Dec 25, 2025