Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace switch with if constexpr #1301

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from
Open

replace switch with if constexpr #1301

wants to merge 9 commits into from

Conversation

jxy
Copy link
Contributor

@jxy jxy commented Jul 22, 2022

This PR replaces the switch statements in color_spinor.h with if constexpr. In order to supply constexpr int to the template parameters, we introduce static_for for compile time unrolling the for loop over integer dimensions, along with a macro static_for_var to adapt to C++17 or C++20.

Preliminary tests on a single A100-PCIE-40GB shows mild improvement (~2%) in Wilson Dslash performance.

From the tunecache.tsv, before,

     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  160     1       1       1037    1       1       41729   -1      -1      -1      -1      0.000172203     # 1271.60 Gflop/s, 2219.51 GB/s, tuning took 0.204862 seconds at Thu Jul 21 21:29:26 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  224     1       1       741     1       1       20865   -1      -1      -1      -1      0.000215859     # 1014.42 Gflop/s, 2065.73 GB/s, tuning took 0.254195 seconds at Thu Jul 21 21:28:15 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       83457   -1      -1      -1      -1      0.000157696     # 1388.57 Gflop/s, 2154.39 GB/s, tuning took 0.216228 seconds at Thu Jul 21 21:30:28 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       15174   -1      -1      -1      -1      8.61013e-05     # 2543.19 Gflop/s, 2219.51 GB/s, tuning took 0.141542 seconds at Thu Jul 21 21:26:43 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  512     1       1       324     1       1       83457   -1      -1      -1      -1      0.000111275     # 1967.85 Gflop/s, 2003.63 GB/s, tuning took 0.189092 seconds at Thu Jul 21 21:25:51 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       8785    -1      -1      -1      -1      7.20457e-05     # 3039.35 Gflop/s, 2357.80 GB/s, tuning took 0.123566 seconds at Thu Jul 21 21:27:29 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  64      1       1       2592    1       1       6420    -1      -1      -1      -1      5.12512e-05     # 4272.53 Gflop/s, 1967.95 GB/s, tuning took 0.106338 seconds at Thu Jul 21 21:24:33 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  256     1       1       648     1       1       27819   -1      -1      -1      -1      6.6752e-05      # 3280.38 Gflop/s, 1749.54 GB/s, tuning took 0.126687 seconds at Thu Jul 21 21:23:51 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  128     1       1       1296    1       1       20865   -1      -1      -1      -1      5.18656e-05     # 4221.92 Gflop/s, 1739.94 GB/s, tuning took 0.110828 seconds at Thu Jul 21 21:25:12 2022

after the changes

     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  320     1       1       519     1       1       41729   -1      -1      -1      -1      0.000168448     # 1299.94 Gflop/s, 2268.99 GB/s, tuning took 0.199168 seconds at Fri Jul 22 03:38:56 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  320     1       1       519     1       1       0       -1      -1      -1      -1      0.000215245     # 1017.32 Gflop/s, 2071.63 GB/s, tuning took 0.251373 seconds at Fri Jul 22 03:37:46 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       0       -1      -1      -1      -1      0.000152722     # 1433.79 Gflop/s, 2224.55 GB/s, tuning took 0.209446 seconds at Fri Jul 22 03:39:58 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       11128   -1      -1      -1      -1      8.48213e-05     # 2581.57 Gflop/s, 2253.01 GB/s, tuning took 0.138615 seconds at Fri Jul 22 03:36:12 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  544     1       1       305     1       1       55638   -1      -1      -1      -1      0.000111502     # 1963.84 Gflop/s, 1999.54 GB/s, tuning took 0.187761 seconds at Fri Jul 22 03:35:19 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  256     1       1       648     1       1       23845   -1      -1      -1      -1      7.16069e-05     # 3057.98 Gflop/s, 2372.25 GB/s, tuning took 0.123879 seconds at Fri Jul 22 03:36:59 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  192     1       1       864     1       1       27819   -1      -1      -1      -1      5.12512e-05     # 4272.53 Gflop/s, 1967.95 GB/s, tuning took 0.105329 seconds at Fri Jul 22 03:34:02 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       33383   -1      -1      -1      -1      6.528e-05       # 3354.35 Gflop/s, 1788.99 GB/s, tuning took 0.124695 seconds at Fri Jul 22 03:33:19 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  128     1       1       1296    1       1       10433   -1      -1      -1      -1      4.97371e-05     # 4402.59 Gflop/s, 1814.40 GB/s, tuning took 0.107688 seconds at Fri Jul 22 03:34:40 2022

@jxy jxy requested a review from a team as a code owner July 22, 2022 04:00
@mathiaswagner
Copy link
Member

Jenkins: Can one of the admins verify this patch?

@@ -1207,7 +1207,7 @@ lhs.real()*rhs.imag()+lhs.imag()*rhs.real());
template <typename real> __host__ __device__ inline complex<real> i_(const complex<real> &a)
{
// FIXME compiler generates worse code with "optimal" code
#if 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing it and forgot to remove it from this branch. I will squash it later. Can you teach me how to see the cuda generated assembly so that I can understand the impact of this switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the assembly can be inspected using cuobjdump

cuobjdump --dump-sass file.o

will dump the GPU assembly to stdout. There's a variety of options for cuobjdump that can be inspected with --help, e.g., to limit which kernels are dumped.

Another interesting option is --dump-resource-usage which will just dump a summary of the resource usage for each kernel (registers, shared memory, stack frame, etc.)

@@ -7,15 +7,42 @@
#include <tune_key.h>
#include <malloc_quda.h>

#ifndef QUDA_MUSTTAIL
#ifdef __clang__
#define QUDA_MUSTTAIL __attribute__((musttail))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this attribute do with clang? Is this is safe to include for all clang-based compilation targets, e.g., ROCm, clang-cuda, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, https://clang.llvm.org/docs/AttributeReference.html#musttail. On the host, instead of preparing the stack pointer, argument passing, and call a function, the compiler should generate a jmp, reusing the existing stack frame. So it's guaranteed that this recursive template expansion would not grow the stack even some how the functions are not inlined. I'm not entirely sure what it would do on a target device. I guess it may not matter if the recursive function is always inlined, or the compiler is smart enough. We should probably check the assembly to be sure.

namespace quda
{
// strip path from __FILE__
constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; }
constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); }
constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); }
constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; }

template<int A, int B, int D=1, typename F>
__attribute__((always_inline)) __host__ __device__ inline void static_for(F&&f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function should be target dependent? On CUDA for example, there's no reason why this shouldn't just correspond to an unroll loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a general utility that does not really care where it's used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that we may want the freedom to change the static_for implementation for different targets. For CUDA, we may want this to correspond to a simple unrolled for loop, and with OMP we may want it to be what you have above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm not sure how to do it without introducing more macros, though.

Copy link
Member

@maddyscientist maddyscientist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @jxy. I've left some questions already, and I understand why a static_for is helpful for non-CUDA architectures.

I don't think the changes to the reconstruct / project interface to make the direction a template parameter are required. Can you explain why you've made this change? Having the direction as a regular direction is more generic, since it allows the compiler to optimize if the variable is known at compile time, but will work with a run-time variable. I think any improvement in performance that you are seeing is not due to this change, and is likely due to changes in instruction ordering from the static_for versus a regular pragma unroll for.

namespace quda
{
// strip path from __FILE__
constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; }
constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); }
constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); }
constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; }

template<int A, int B, int D=1, typename F>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to rename these variables to make them more intuitive (lower instead of A, upper instead of B) and add comments on how this works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Once we understand the full impact on the code gen.

}
}
} // nDim
}); // nDim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is an easy way to fall back to pragma unroll for (easy meaning having a macro define or something like that)? Seems to make the templates work we need to add this additional ); to the end of each for loop, and there is no easy way to turn the template variables into C++ variables easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blame C++ for obnoxious syntax. I can't figure out a way to deal with the ending of the block, apart from defining a pair of macros like BEGIN_STATIC_FOR/END_STATIC_FOR.

The lambda in the C++17 version of the static_for just uses auto. Instead of template parameters, the functions in color_spinor.h could also be defined with auto arguments, though the function body would be more complicated depending on how much we want to help the compiler in optimizing the switch statements.

@jxy
Copy link
Contributor Author

jxy commented Jul 27, 2022

Let me look at the generated code in more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants