Standardizing Benchmarking to 1b tokens #85
bigwolfeman
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem: Loss is not a reliable number for final model performance.
Solution: Benchmarks are better metrics for final model performance.
Can we benchmark 500m or 1b token trained models in a meaningful way to better determine model performance in a more iterative methodology? Vuk is showing almost 3 hours for 1b token training. This is still quite long for iterative workflows.
Experiment:
I will train out several models to 1b tokens, benchmark them ever 100m tokens. I will perhaps do this 3 times for each architecture using 3 different seeds to get a very rough idea on the kind of spread we are looking at. If the spread is low enough and we see each model achieving its own regiment with in the space this is a reliable way to benchmark changes while ignoring loss. Hopefully 500m is enough to see clear patterns emerging.
I will record each to the same wandb project using standard metrics.
Beta Was this translation helpful? Give feedback.
All reactions