GPU trsm performance drops with large block sizes

**[Original report](https://bitbucket.org/icldistcomp/dplasma/issue/30) by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).**

----------------------------------------

I’m comparing the performance of dplasma with other libraries on GPUs, and I’m particularly looking at trsm at the moment. I see performance initially increase with the block size until it reaches a good fraction of the peak flops of the GPU, but after that performance drops significantly. More concretely, I’m running dplasma on a single node with a P100 GPU, 12 core Haswell CPU $Piz Daint GPU partition$, built in release mode with GCC 8.3, CUDA 10.2 $I pass no additional options to the CMake configuration except the build type$ and get the following results:

```
$ for block_exp in {6..14}; do block_size=$((2 ** block_exp)); srun -n 1 tests/testing_strsm -c 12 -M 16384 -N 16384 --NB ${block_size} --MB ${block_size} -p 1 -q 1 -g 1; done                              
[****] TIME(s)    828.03722 : strsm     PxQ=   1 1   NB=   64 N=   16384 :       5.311736 gflops                                                                                                                                                                                 
[****] TIME(s)     14.29047 : strsm     PxQ=   1 1   NB=  128 N=   16384 :     307.779589 gflops                                                                                                                                                                                 
[****] TIME(s)      2.02718 : strsm     PxQ=   1 1   NB=  256 N=   16384 :    2169.675000 gflops                                                                                                                                                                                 
[****] TIME(s)      0.95916 : strsm     PxQ=   1 1   NB=  512 N=   16384 :    4585.610550 gflops                                                                                                                                                                                 
[****] TIME(s)      1.36078 : strsm     PxQ=   1 1   NB= 1024 N=   16384 :    3232.196247 gflops                                                                                                                                                                                 
[****] TIME(s)      2.25861 : strsm     PxQ=   1 1   NB= 2048 N=   16384 :    1947.351262 gflops                                                                                                                                                                                 
[****] TIME(s)      5.96737 : strsm     PxQ=   1 1   NB= 4096 N=   16384 :     737.061370 gflops                                                                                                                                                                                 
[****] TIME(s)     14.53428 : strsm     PxQ=   1 1   NB= 8192 N=   16384 :     302.616577 gflops                                                                                                                                                                                 
[****] TIME(s)     56.22007 : strsm     PxQ=   1 1   NB= 16384 N=   16384 :      78.233892 gflops
```

Is this expected behaviour? What could explain the drop? Is there something in the configuration that could be causing this? I don’t actually expect to be running with huge block sizes $especially `B == N`$, but I wouldn’t have expected performance to start dropping already at `NB = MB = 1024`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU trsm performance drops with large block sizes #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU trsm performance drops with large block sizes #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions