Skip to content

GPU trsm performance drops with large block sizes #30

@abouteiller

Description

@abouteiller

Original report by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).


I’m comparing the performance of dplasma with other libraries on GPUs, and I’m particularly looking at trsm at the moment. I see performance initially increase with the block size until it reaches a good fraction of the peak flops of the GPU, but after that performance drops significantly. More concretely, I’m running dplasma on a single node with a P100 GPU, 12 core Haswell CPU (Piz Daint GPU partition), built in release mode with GCC 8.3, CUDA 10.2 (I pass no additional options to the CMake configuration except the build type) and get the following results:

$ for block_exp in {6..14}; do block_size=$((2 ** block_exp)); srun -n 1 tests/testing_strsm -c 12 -M 16384 -N 16384 --NB ${block_size} --MB ${block_size} -p 1 -q 1 -g 1; done                              
[****] TIME(s)    828.03722 : strsm     PxQ=   1 1   NB=   64 N=   16384 :       5.311736 gflops                                                                                                                                                                                 
[****] TIME(s)     14.29047 : strsm     PxQ=   1 1   NB=  128 N=   16384 :     307.779589 gflops                                                                                                                                                                                 
[****] TIME(s)      2.02718 : strsm     PxQ=   1 1   NB=  256 N=   16384 :    2169.675000 gflops                                                                                                                                                                                 
[****] TIME(s)      0.95916 : strsm     PxQ=   1 1   NB=  512 N=   16384 :    4585.610550 gflops                                                                                                                                                                                 
[****] TIME(s)      1.36078 : strsm     PxQ=   1 1   NB= 1024 N=   16384 :    3232.196247 gflops                                                                                                                                                                                 
[****] TIME(s)      2.25861 : strsm     PxQ=   1 1   NB= 2048 N=   16384 :    1947.351262 gflops                                                                                                                                                                                 
[****] TIME(s)      5.96737 : strsm     PxQ=   1 1   NB= 4096 N=   16384 :     737.061370 gflops                                                                                                                                                                                 
[****] TIME(s)     14.53428 : strsm     PxQ=   1 1   NB= 8192 N=   16384 :     302.616577 gflops                                                                                                                                                                                 
[****] TIME(s)     56.22007 : strsm     PxQ=   1 1   NB= 16384 N=   16384 :      78.233892 gflops

Is this expected behaviour? What could explain the drop? Is there something in the configuration that could be causing this? I don’t actually expect to be running with huge block sizes (especially B == N), but I wouldn’t have expected performance to start dropping already at NB = MB = 1024.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh priorityThis is an important feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions