replace switch with if constexpr #1301

jxy · 2022-07-22T04:00:24Z

This PR replaces the switch statements in color_spinor.h with if constexpr. In order to supply constexpr int to the template parameters, we introduce static_for for compile time unrolling the for loop over integer dimensions, along with a macro static_for_var to adapt to C++17 or C++20.

Preliminary tests on a single A100-PCIE-40GB shows mild improvement (~2%) in Wilson Dslash performance.

From the tunecache.tsv, before,

     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  160     1       1       1037    1       1       41729   -1      -1      -1      -1      0.000172203     # 1271.60 Gflop/s, 2219.51 GB/s, tuning took 0.204862 seconds at Thu Jul 21 21:29:26 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  224     1       1       741     1       1       20865   -1      -1      -1      -1      0.000215859     # 1014.42 Gflop/s, 2065.73 GB/s, tuning took 0.254195 seconds at Thu Jul 21 21:28:15 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       83457   -1      -1      -1      -1      0.000157696     # 1388.57 Gflop/s, 2154.39 GB/s, tuning took 0.216228 seconds at Thu Jul 21 21:30:28 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       15174   -1      -1      -1      -1      8.61013e-05     # 2543.19 Gflop/s, 2219.51 GB/s, tuning took 0.141542 seconds at Thu Jul 21 21:26:43 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  512     1       1       324     1       1       83457   -1      -1      -1      -1      0.000111275     # 1967.85 Gflop/s, 2003.63 GB/s, tuning took 0.189092 seconds at Thu Jul 21 21:25:51 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       8785    -1      -1      -1      -1      7.20457e-05     # 3039.35 Gflop/s, 2357.80 GB/s, tuning took 0.123566 seconds at Thu Jul 21 21:27:29 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  64      1       1       2592    1       1       6420    -1      -1      -1      -1      5.12512e-05     # 4272.53 Gflop/s, 1967.95 GB/s, tuning took 0.106338 seconds at Thu Jul 21 21:24:33 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  256     1       1       648     1       1       27819   -1      -1      -1      -1      6.6752e-05      # 3280.38 Gflop/s, 1749.54 GB/s, tuning took 0.126687 seconds at Thu Jul 21 21:23:51 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  128     1       1       1296    1       1       20865   -1      -1      -1      -1      5.18656e-05     # 4221.92 Gflop/s, 1739.94 GB/s, tuning took 0.110828 seconds at Thu Jul 21 21:25:12 2022

after the changes

     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  320     1       1       519     1       1       41729   -1      -1      -1      -1      0.000168448     # 1299.94 Gflop/s, 2268.99 GB/s, tuning took 0.199168 seconds at Fri Jul 22 03:38:56 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  320     1       1       519     1       1       0       -1      -1      -1      -1      0.000215245     # 1017.32 Gflop/s, 2071.63 GB/s, tuning took 0.251373 seconds at Fri Jul 22 03:37:46 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIdLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       0       -1      -1      -1      -1      0.000152722     # 1433.79 Gflop/s, 2224.55 GB/s, tuning took 0.209446 seconds at Fri Jul 22 03:39:58 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  32      1       1       5184    1       1       11128   -1      -1      -1      -1      8.48213e-05     # 2581.57 Gflop/s, 2253.01 GB/s, tuning took 0.138615 seconds at Fri Jul 22 03:36:12 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  544     1       1       305     1       1       55638   -1      -1      -1      -1      0.000111502     # 1963.84 Gflop/s, 1999.54 GB/s, tuning took 0.187761 seconds at Fri Jul 22 03:35:19 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIfLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  256     1       1       648     1       1       23845   -1      -1      -1      -1      7.16069e-05     # 3057.98 Gflop/s, 2372.25 GB/s, tuning took 0.123879 seconds at Fri Jul 22 03:36:59 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s12EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  192     1       1       864     1       1       27819   -1      -1      -1      -1      5.12512e-05     # 4272.53 Gflop/s, 1967.95 GB/s, tuning took 0.105329 seconds at Fri Jul 22 03:34:02 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0000,commDim=0000,parity=1  384     1       1       432     1       1       33383   -1      -1      -1      -1      6.528e-05       # 3354.35 Gflop/s, 1788.99 GB/s, tuning took 0.124695 seconds at Fri Jul 22 03:33:19 2022
     12x24x24x24        N4quda6WilsonINS_9WilsonArgIsLi3ELi4EL21QudaReconstructType_s8EEEEE     policy_kernel=interior,comm=0000,commDim=0000,parity=1  128     1       1       1296    1       1       10433   -1      -1      -1      -1      4.97371e-05     # 4402.59 Gflop/s, 1814.40 GB/s, tuning took 0.107688 seconds at Fri Jul 22 03:34:40 2022

It accepts a functor with a templated operator().

mathiaswagner · 2022-07-22T04:00:27Z

Jenkins: Can one of the admins verify this patch?

maddyscientist · 2022-07-25T21:59:08Z

include/complex_quda.h

@@ -1207,7 +1207,7 @@ lhs.real()*rhs.imag()+lhs.imag()*rhs.real());
  template <typename real> __host__ __device__ inline complex<real> i_(const complex<real> &a)
  {
    // FIXME compiler generates worse code with "optimal" code
-#if 1


Why did you make this change?

I was testing it and forgot to remove it from this branch. I will squash it later. Can you teach me how to see the cuda generated assembly so that I can understand the impact of this switch?

Sure, the assembly can be inspected using cuobjdump

cuobjdump --dump-sass file.o

will dump the GPU assembly to stdout. There's a variety of options for cuobjdump that can be inspected with --help, e.g., to limit which kernels are dumped.

Another interesting option is --dump-resource-usage which will just dump a summary of the resource usage for each kernel (registers, shared memory, stack frame, etc.)

maddyscientist · 2022-07-25T22:00:48Z

include/util_quda.h

@@ -7,15 +7,42 @@
 #include <tune_key.h>
 #include <malloc_quda.h>

+#ifndef QUDA_MUSTTAIL
+#ifdef __clang__
+#define QUDA_MUSTTAIL __attribute__((musttail))


What does this attribute do with clang? Is this is safe to include for all clang-based compilation targets, e.g., ROCm, clang-cuda, etc.?

Here, https://clang.llvm.org/docs/AttributeReference.html#musttail. On the host, instead of preparing the stack pointer, argument passing, and call a function, the compiler should generate a jmp, reusing the existing stack frame. So it's guaranteed that this recursive template expansion would not grow the stack even some how the functions are not inlined. I'm not entirely sure what it would do on a target device. I guess it may not matter if the recursive function is always inlined, or the compiler is smart enough. We should probably check the assembly to be sure.

maddyscientist · 2022-07-25T22:08:28Z

include/util_quda.h

 namespace quda
 {
  // strip path from __FILE__
  constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; }
  constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); }
  constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); }
  constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; }
+
+  template<int A, int B, int D=1, typename F>
+  __attribute__((always_inline)) __host__ __device__ inline void static_for(F&&f)


Maybe this function should be target dependent? On CUDA for example, there's no reason why this shouldn't just correspond to an unroll loop?

It's a general utility that does not really care where it's used.

My point is that we may want the freedom to change the static_for implementation for different targets. For CUDA, we may want this to correspond to a simple unrolled for loop, and with OMP we may want it to be what you have above.

Makes sense. I'm not sure how to do it without introducing more macros, though.

maddyscientist

Thanks for the contribution @jxy. I've left some questions already, and I understand why a static_for is helpful for non-CUDA architectures.

I don't think the changes to the reconstruct / project interface to make the direction a template parameter are required. Can you explain why you've made this change? Having the direction as a regular direction is more generic, since it allows the compiler to optimize if the variable is known at compile time, but will work with a run-time variable. I think any improvement in performance that you are seeing is not due to this change, and is likely due to changes in instruction ordering from the static_for versus a regular pragma unroll for.

maddyscientist · 2022-07-25T22:18:33Z

include/util_quda.h

 namespace quda
 {
  // strip path from __FILE__
  constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; }
  constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); }
  constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); }
  constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; }
+
+  template<int A, int B, int D=1, typename F>


Probably better to rename these variables to make them more intuitive (lower instead of A, upper instead of B) and add comments on how this works.

Yes. Once we understand the full impact on the code gen.

hummingtree · 2022-07-25T22:32:38Z

include/kernels/dslash_wilson.cuh

        }
      }
-    } // nDim
+    }); // nDim


Do you know if there is an easy way to fall back to pragma unroll for (easy meaning having a macro define or something like that)? Seems to make the templates work we need to add this additional ); to the end of each for loop, and there is no easy way to turn the template variables into C++ variables easily.

Blame C++ for obnoxious syntax. I can't figure out a way to deal with the ending of the block, apart from defining a pair of macros like BEGIN_STATIC_FOR/END_STATIC_FOR.

The lambda in the C++17 version of the static_for just uses auto. Instead of template parameters, the functions in color_spinor.h could also be defined with auto arguments, though the function body would be more complicated depending on how much we want to help the compiler in optimizing the switch statements.

jxy · 2022-07-27T17:45:47Z

Let me look at the generated code in more details.

This reverts commit 3216cb4.

jxy added 7 commits July 21, 2022 20:41

add static_for in util_quda.h

1531779

It accepts a functor with a templated operator().

flip the switch in complex i_ function

3216cb4

replace switch with if constexpr in color_spinor.h

16a632c

static_for: guard against missing musttail if not clang

6b83e57

constexpr: fix domain wall and twisted mass color spinor functions

aa67c80

make static_for work for c++17

fd79ac5

make static_for __device__

3ce2a7b

jxy requested a review from a team as a code owner July 22, 2022 04:00

maddyscientist reviewed Jul 25, 2022

View reviewed changes

hummingtree reviewed Jul 25, 2022

View reviewed changes

jxy added 2 commits July 28, 2022 20:04

static_for: use only the c++17 compat version and remove macros

8770428

Revert "flip the switch in complex i_ function"

87d76e2

This reverts commit 3216cb4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace switch with if constexpr #1301

replace switch with if constexpr #1301

jxy commented Jul 22, 2022

mathiaswagner commented Jul 22, 2022

maddyscientist Jul 25, 2022

jxy Jul 27, 2022

maddyscientist Jul 27, 2022

maddyscientist Jul 25, 2022

jxy Jul 27, 2022

maddyscientist Jul 25, 2022

jxy Jul 27, 2022

maddyscientist Jul 27, 2022

jxy Jul 27, 2022

maddyscientist left a comment

maddyscientist Jul 25, 2022

jxy Jul 27, 2022

hummingtree Jul 25, 2022

jxy Jul 27, 2022

jxy commented Jul 27, 2022

replace switch with if constexpr #1301

Are you sure you want to change the base?

replace switch with if constexpr #1301

Conversation

jxy commented Jul 22, 2022

mathiaswagner commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maddyscientist left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jxy commented Jul 27, 2022