[Performance] Multithreading for DequantizeLinear #23395

tarekziade · 2025-01-16T11:44:15Z

Describe the issue

The current DequantizeLinear CPU operator does not use threads.

I have implemented a quick prototype that shows a 4x speed up on that operator when used with a Qwen 2.5 0.5B model

I do see a comment about this:

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc#L302

@fajin-corp is this something you were planning to implement? I'd be happy to help under your guidance

To reproduce

n/a

Urgency

No response

Platform

Windows

OS Version

any

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

yuslepukhin · 2025-01-16T17:08:35Z

Go ahead and PR it.

fajin-corp · 2025-01-16T18:50:11Z

@tarekziade I'm not working on it. You are very welcome to open a PR for it.

tarekziade · 2025-01-27T14:29:33Z

@fajin-corp @yuslepukhin a lot of our models are used as int8.

Would it make sense first to implement something similar to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp for int8?

fajin-corp · 2025-01-27T22:48:42Z

If you mean "run a int8 model using optimized int8 operators", onnx has this matmulinteger that is specifically designed for int8 matmul.
q4_dq.cpp contains the kernels that converts a floating point tensor to a int4 tensor. However, for fp -> int8, it is more widely available and you can find many libs to do the conversion.

tarekziade · 2025-01-28T07:50:27Z

@fajin-corp I am confused, I thought dequantization was about converting ints into float, not the other way around.

several things I have noticed on my Qwen run:

matmulinteger is not used when running an int8 model like Qwen because the optimizer replaces it with matmulintegertofloat.
the DequantizeLinear operation is called in that model on uint8_t inputs, and I got a good performance boost (8%) by going multithread for that one.

So looking at the quantize_linear.cc file, the op that gets called in my case is https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc#L287-L298

so should I try to use https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp#L1656 ?
that seems to call https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp#L574

thanks

fajin-corp · 2025-01-28T19:48:50Z

looks like you are trying to boost up your int8 model by speed up int8 dequantize. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4_dq.cpp#L1656 currently only supports int4 ->float and is designed specifically for MatMulNBits.
Your multi-thread dequantize should go to quantize_linear.cc, similar to the multi-threading in quantizelinear.

tarekziade added the performance issues related to performance regressions label Jan 16, 2025

github-actions bot added the quantization issues related to quantization label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Multithreading for DequantizeLinear #23395

[Performance] Multithreading for DequantizeLinear #23395

tarekziade commented Jan 16, 2025

yuslepukhin commented Jan 16, 2025

fajin-corp commented Jan 16, 2025

tarekziade commented Jan 27, 2025

fajin-corp commented Jan 27, 2025 •

edited

Loading

tarekziade commented Jan 28, 2025

fajin-corp commented Jan 28, 2025 •

edited

Loading

[Performance] Multithreading for DequantizeLinear #23395

[Performance] Multithreading for DequantizeLinear #23395

Comments

tarekziade commented Jan 16, 2025

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

yuslepukhin commented Jan 16, 2025

fajin-corp commented Jan 16, 2025

tarekziade commented Jan 27, 2025

fajin-corp commented Jan 27, 2025 • edited Loading

tarekziade commented Jan 28, 2025

fajin-corp commented Jan 28, 2025 • edited Loading

fajin-corp commented Jan 27, 2025 •

edited

Loading

fajin-corp commented Jan 28, 2025 •

edited

Loading