-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace switch with if constexpr #1301
base: develop
Are you sure you want to change the base?
Conversation
It accepts a functor with a templated operator().
Jenkins: Can one of the admins verify this patch? |
include/complex_quda.h
Outdated
@@ -1207,7 +1207,7 @@ lhs.real()*rhs.imag()+lhs.imag()*rhs.real()); | |||
template <typename real> __host__ __device__ inline complex<real> i_(const complex<real> &a) | |||
{ | |||
// FIXME compiler generates worse code with "optimal" code | |||
#if 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing it and forgot to remove it from this branch. I will squash it later. Can you teach me how to see the cuda generated assembly so that I can understand the impact of this switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the assembly can be inspected using cuobjdump
cuobjdump --dump-sass file.o
will dump the GPU assembly to stdout. There's a variety of options for cuobjdump
that can be inspected with --help
, e.g., to limit which kernels are dumped.
Another interesting option is --dump-resource-usage
which will just dump a summary of the resource usage for each kernel (registers, shared memory, stack frame, etc.)
include/util_quda.h
Outdated
@@ -7,15 +7,42 @@ | |||
#include <tune_key.h> | |||
#include <malloc_quda.h> | |||
|
|||
#ifndef QUDA_MUSTTAIL | |||
#ifdef __clang__ | |||
#define QUDA_MUSTTAIL __attribute__((musttail)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this attribute do with clang? Is this is safe to include for all clang-based compilation targets, e.g., ROCm, clang-cuda, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, https://clang.llvm.org/docs/AttributeReference.html#musttail. On the host, instead of preparing the stack pointer, argument passing, and call
a function, the compiler should generate a jmp
, reusing the existing stack frame. So it's guaranteed that this recursive template expansion would not grow the stack even some how the functions are not inlined. I'm not entirely sure what it would do on a target device. I guess it may not matter if the recursive function is always inlined, or the compiler is smart enough. We should probably check the assembly to be sure.
namespace quda | ||
{ | ||
// strip path from __FILE__ | ||
constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; } | ||
constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); } | ||
constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); } | ||
constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; } | ||
|
||
template<int A, int B, int D=1, typename F> | ||
__attribute__((always_inline)) __host__ __device__ inline void static_for(F&&f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this function should be target dependent? On CUDA for example, there's no reason why this shouldn't just correspond to an unroll loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a general utility that does not really care where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that we may want the freedom to change the static_for
implementation for different targets. For CUDA, we may want this to correspond to a simple unrolled for loop, and with OMP we may want it to be what you have above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm not sure how to do it without introducing more macros, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @jxy. I've left some questions already, and I understand why a static_for
is helpful for non-CUDA architectures.
I don't think the changes to the reconstruct / project interface to make the direction a template parameter are required. Can you explain why you've made this change? Having the direction as a regular direction is more generic, since it allows the compiler to optimize if the variable is known at compile time, but will work with a run-time variable. I think any improvement in performance that you are seeing is not due to this change, and is likely due to changes in instruction ordering from the static_for
versus a regular pragma unroll for
.
namespace quda | ||
{ | ||
// strip path from __FILE__ | ||
constexpr const char *str_end(const char *str) { return *str ? str_end(str + 1) : str; } | ||
constexpr bool str_slant(const char *str) { return *str == '/' ? true : (*str ? str_slant(str + 1) : false); } | ||
constexpr const char *r_slant(const char *str) { return *str == '/' ? (str + 1) : r_slant(str - 1); } | ||
constexpr const char *file_name(const char *str) { return str_slant(str) ? r_slant(str_end(str)) : str; } | ||
|
||
template<int A, int B, int D=1, typename F> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to rename these variables to make them more intuitive (lower
instead of A
, upper
instead of B
) and add comments on how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Once we understand the full impact on the code gen.
} | ||
} | ||
} // nDim | ||
}); // nDim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there is an easy way to fall back to pragma unroll for
(easy meaning having a macro define or something like that)? Seems to make the templates work we need to add this additional );
to the end of each for loop, and there is no easy way to turn the template variables into C++ variables easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blame C++ for obnoxious syntax. I can't figure out a way to deal with the ending of the block, apart from defining a pair of macros like BEGIN_STATIC_FOR
/END_STATIC_FOR
.
The lambda in the C++17 version of the static_for
just uses auto
. Instead of template parameters, the functions in color_spinor.h
could also be defined with auto
arguments, though the function body would be more complicated depending on how much we want to help the compiler in optimizing the switch statements.
Let me look at the generated code in more details. |
This PR replaces the switch statements in
color_spinor.h
withif constexpr
. In order to supplyconstexpr int
to the template parameters, we introducestatic_for
for compile time unrolling the for loop over integer dimensions, along with a macrostatic_for_var
to adapt to C++17 or C++20.Preliminary tests on a single A100-PCIE-40GB shows mild improvement (~2%) in Wilson Dslash performance.
From the
tunecache.tsv
, before,after the changes