-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Original report by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).
I’m comparing the performance of dplasma with other libraries on GPUs, and I’m particularly looking at trsm at the moment. I see performance initially increase with the block size until it reaches a good fraction of the peak flops of the GPU, but after that performance drops significantly. More concretely, I’m running dplasma on a single node with a P100 GPU, 12 core Haswell CPU (Piz Daint GPU partition), built in release mode with GCC 8.3, CUDA 10.2 (I pass no additional options to the CMake configuration except the build type) and get the following results:
$ for block_exp in {6..14}; do block_size=$((2 ** block_exp)); srun -n 1 tests/testing_strsm -c 12 -M 16384 -N 16384 --NB ${block_size} --MB ${block_size} -p 1 -q 1 -g 1; done
[****] TIME(s) 828.03722 : strsm PxQ= 1 1 NB= 64 N= 16384 : 5.311736 gflops
[****] TIME(s) 14.29047 : strsm PxQ= 1 1 NB= 128 N= 16384 : 307.779589 gflops
[****] TIME(s) 2.02718 : strsm PxQ= 1 1 NB= 256 N= 16384 : 2169.675000 gflops
[****] TIME(s) 0.95916 : strsm PxQ= 1 1 NB= 512 N= 16384 : 4585.610550 gflops
[****] TIME(s) 1.36078 : strsm PxQ= 1 1 NB= 1024 N= 16384 : 3232.196247 gflops
[****] TIME(s) 2.25861 : strsm PxQ= 1 1 NB= 2048 N= 16384 : 1947.351262 gflops
[****] TIME(s) 5.96737 : strsm PxQ= 1 1 NB= 4096 N= 16384 : 737.061370 gflops
[****] TIME(s) 14.53428 : strsm PxQ= 1 1 NB= 8192 N= 16384 : 302.616577 gflops
[****] TIME(s) 56.22007 : strsm PxQ= 1 1 NB= 16384 N= 16384 : 78.233892 gflops
Is this expected behaviour? What could explain the drop? Is there something in the configuration that could be causing this? I don’t actually expect to be running with huge block sizes (especially B == N), but I wouldn’t have expected performance to start dropping already at NB = MB = 1024.