Skip to content

Commit

Permalink
nits
Browse files Browse the repository at this point in the history
  • Loading branch information
hynky1999 committed Feb 14, 2025
1 parent 04fc67a commit b19cfea
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions math_verify_leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors:

# Fixing Open LLM Leaderboard with Math-Verify

3 weeks ago, we showed how hard it is to correctly evaluate LLM performance on math problems, and introduced [Math-Verify](https://github.com/huggingface/Math-Verify), a better solution to math evaluations (read more in the [announcement](https://x.com/HKydlicek/status/1881734376696041659))!
3 weeks ago, we showed how hard it is to correctly evaluate LLM performance on math problems, and introduced [Math-Verify](https://github.com/huggingface/Math-Verify), a better solution to validate models on math (read more in the [announcement](https://x.com/HKydlicek/status/1881734376696041659))!


Today, we’re thrilled to share that we’ve used Math-Verify to thoroughly re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard, for even fairer and more robust model comparisons!
Expand Down Expand Up @@ -74,7 +74,7 @@ On the final step, when comparing the extracted answer with the target expressio
**All of these issues are now completely fixed with the new Math-Verify parser!**
## Which model is the best at math? A complete reshuffling of cards thanks to fairer evaluations

As all these issues tend to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Verify, which roughly equaled changing only 3 lines of code! (You can try it too on your math evals!)
As all these issues tend to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Verify, which was as simple as changing only 3 lines of code! (You can try it too on your math evals!)

This therefore meant re-evaluating all submitted models since June… and it completely overhauled the top 20 models on the MATH subset of the leaderboard.

Expand Down Expand Up @@ -104,9 +104,11 @@ Following is the complete table comparing the old and new Top 20 leaderboard ran
![math_hard_leaderboard_change](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/math_verify_leaderboard/math-hard-change.png)

### Changes in the Leaderboard
Finally, we examined how the overall Leaderboard results have evolved. While the top four positions remain unchanged, the rest have undergone significant shifts. Due to the rise of multiple Qwen derivatives in the MATH subset, the presence of Qwen models among the top 20 has grown-derived models grown even further at the Overall results.
Finally, we examined how the overall Leaderboard results have evolved. While the top four positions remain unchanged, the rest have undergone significant shifts. Due to the rise of multiple Qwen derivatives in the MATH subset, the presence of Qwen models among the top 20 has grown-derived models grown even further at the Overall results.
![leaderboard_change](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/math_verify_leaderboard/overal-change.png)

Many other models also completely jumped in the rankings, gaining 200 places or more! You can check out the results in more detail at the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/).

## Wrapping Up
The introduction of Math-Verify has significantly improved the accuracy and fairness of our evaluations on the Open LLM Leaderboard. This has led to a reshuffling of the leaderboard, with many models showing substantial improvements in their scores.

Expand Down

0 comments on commit b19cfea

Please sign in to comment.