Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why can't there be parallel inference when the batch size is greater than 1 in C++ OpenCV CUDA DNN? #3799

Open
kj2314 opened this issue Sep 24, 2024 · 2 comments

Comments

@kj2314
Copy link

kj2314 commented Sep 24, 2024

  • OpenCV => 4.9
  • Operating System / Platform => Windows 64 Bit
  • Compiler => Visual Studio 2022
  • cuda =11.6
  • cudnn = 8.6.0
  • Driver Version = 536.45
  • GPU PTX4050 6G

Detailed description:
I used the C++ version of OpenCV for model inference with a simple convolutional network using the GPU. In release mode, when the batch size is 1, the inference time is 40 ms, but when the batch size is 4, the time is approximately 160 ms. The expectation is that the inference time for the model is 40 ms, whether the batch size is 1 or 4. Why is there no parallel inference? In debug mode, the following error is output:

[ INFO:[email protected]] global registry_parallel.impl.hpp:96 cv::parallel::ParallelBackendRegistry::ParallelBackendRegistry core(parallel): Enabled backends(3, sorted by priority): ONETBB(1000); TBB(990); OPENMP(980)
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load D:\code\ISImgDetect\demo\opencv_core_parallel_onetbb490_64d.dll => FAILED
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load opencv_core_parallel_onetbb490_64d.dll => FAILED
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load D:\code\ISImgDetect\demo\opencv_core_parallel_tbb490_64d.dll => FAILED
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load opencv_core_parallel_tbb490_64d.dll => FAILED
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load D:\code\ISImgDetect\demo\opencv_core_parallel_openmp490_64d.dll => FAILED
[ INFO:[email protected]] global plugin_loader.impl.hpp:67 cv::plugin::impl::DynamicLib::libraryLoad load opencv_core_parallel_openmp490_64d.dll => FAILED
[ INFO:[email protected]] global op_cuda.cpp:80 cv::dnn::dnn4_v20231225::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type NetInputLayer

the layer "_input" of type NetInputLayer be accelerated with GPU, instead using CPU.
Why can't the model perform parallel inference?
How to solve this problem? pls!

@cudawarped
Copy link
Contributor

The expectation is that the inference time for the model is 40 ms, whether the batch size is 1 or 4

Would you expect the inference time to be 40ms if the batch size was 1,000,000?

I would guess your GPU is already saturated with a batch size of 1.

@kj2314
Copy link
Author

kj2314 commented Sep 24, 2024

@cudawarped

when the batch size is 1,the GPU usage is 24% and when the batch size is 4 ,the GPU usage is 28%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants