-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Multithreading for DequantizeLinear #23395
Comments
Go ahead and PR it. |
@tarekziade I'm not working on it. You are very welcome to open a PR for it. |
@fajin-corp @yuslepukhin a lot of our models are used as int8. Would it make sense first to implement something similar to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp for int8? |
If you mean "run a int8 model using optimized int8 operators", onnx has this matmulinteger that is specifically designed for int8 matmul. |
@fajin-corp I am confused, I thought dequantization was about converting ints into float, not the other way around. several things I have noticed on my Qwen run:
So looking at the quantize_linear.cc file, the op that gets called in my case is https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc#L287-L298 so should I try to use https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp#L1656 ? thanks |
looks like you are trying to boost up your int8 model by speed up int8 dequantize. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp#L1656 currently only supports int4 ->float and is designed specifically for MatMulNBits. |
Describe the issue
The current DequantizeLinear CPU operator does not use threads.
I have implemented a quick prototype that shows a 4x speed up on that operator when used with a Qwen 2.5 0.5B model
I do see a comment about this:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc#L302
@fajin-corp is this something you were planning to implement? I'd be happy to help under your guidance
To reproduce
n/a
Urgency
No response
Platform
Windows
OS Version
any
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
main
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: