nits

huggingface · Feb 14, 2025 · b19cfea · b19cfea
1 parent 04fc67a
commit b19cfea
Showing 1 changed file with 5 additions and 3 deletions.
diff --git a/math_verify_leaderboard.md b/math_verify_leaderboard.md
@@ -11,7 +11,7 @@ authors:
 
 # Fixing Open LLM Leaderboard with Math-Verify
 
-3 weeks ago, we showed how hard it is to correctly evaluate LLM performance on math problems, and introduced [Math-Verify](https://github.com/huggingface/Math-Verify), a better solution to math evaluations (read more in the [announcement](https://x.com/HKydlicek/status/1881734376696041659))!
+3 weeks ago, we showed how hard it is to correctly evaluate LLM performance on math problems, and introduced [Math-Verify](https://github.com/huggingface/Math-Verify), a better solution to validate models on math (read more in the [announcement](https://x.com/HKydlicek/status/1881734376696041659))!
 
 
 Today, we’re thrilled to share that we’ve used Math-Verify to thoroughly re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard, for even fairer and more robust model comparisons!
@@ -74,7 +74,7 @@ On the final step, when comparing the extracted answer with the target expressio
 **All of these issues are now completely fixed with the new Math-Verify parser!**
 ## Which model is the best at math? A complete reshuffling of cards thanks to fairer evaluations
 
-As all these issues tend to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Verify, which roughly equaled changing only 3 lines of code! (You can try it too on your math evals!) 
+As all these issues tend to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Verify, which was as simple as changing only 3 lines of code! (You can try it too on your math evals!) 
 
 This therefore meant re-evaluating all submitted models since June… and it completely overhauled the top 20 models on the MATH subset of the leaderboard.
 
@@ -104,9 +104,11 @@ Following is the complete table comparing the old and new Top 20 leaderboard ran
 ![math_hard_leaderboard_change](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/math_verify_leaderboard/math-hard-change.png)
 
 ### Changes in the Leaderboard
-Finally, we examined how the overall Leaderboard results have evolved. While the top four positions remain unchanged, the rest have undergone significant shifts. Due to the rise of multiple Qwen derivatives in the MATH subset, the presence of Qwen models among the top 20 has grown-derived models grown even further at the Overall results.
+Finally, we examined how the overall Leaderboard results have evolved. While the top four positions remain unchanged, the rest have undergone significant shifts. Due to the rise of multiple Qwen derivatives in the MATH subset, the presence of Qwen models among the top 20 has grown-derived models grown even further at the Overall results. 
 ![leaderboard_change](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/math_verify_leaderboard/overal-change.png)
 
+Many other models also completely jumped in the rankings, gaining 200 places or more! You can check out the results in more detail at the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/).
+
 ## Wrapping Up
 The introduction of Math-Verify has significantly improved the accuracy and fairness of our evaluations on the Open LLM Leaderboard. This has led to a reshuffling of the leaderboard, with many models showing substantial improvements in their scores.