Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llm benchmark #881

Merged
merged 16 commits into from
Mar 19, 2025
Merged

Add llm benchmark #881

merged 16 commits into from
Mar 19, 2025

Conversation

haixuanTao
Copy link
Collaborator

@haixuanTao haixuanTao commented Mar 17, 2025

This PR makes it possible to benchmark inference speed based on inference engine.

In my case I was able to detect that for GGUF, the model runs twice as fast as on transformers based config for qwen2.5 0.5B:

path date average_duration (s) max_duration (s) min_duration (s) median_duration (s) median_frequency average_speed (tokens/s) max_speed (tokens/s) min_speed (tokens/s) median_speed (tokens/s) total_tokens
dora-llama-cpp-python 2025-03-17 15:45:25 0.03 0.09 0.03 0.03 37.76 222.73 233.59 69.38 226.54 6
dora-transformers 2025-03-17 16:20:33 0.07 0.40 0.05 0.06 16.15 96.37 111.81 15.14 96.90 6

@haixuanTao
Copy link
Collaborator Author

@MunishMummadi FYI, there is some minor breaking changes within transformers to make it slightly more easily debuggable. I had some weird bug on the way with inconsistent responses.

I think we can readd your optimizations later on when we can consistently test them.

S

@haixuanTao haixuanTao merged commit dfb5942 into main Mar 19, 2025
125 checks passed
@haixuanTao haixuanTao deleted the add-llm-benchmark branch March 19, 2025 11:05
@MunishMummadi
Copy link
Contributor

Noted. Happy to do so, when you want me to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants