-
Notifications
You must be signed in to change notification settings - Fork 795
[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851
Conversation
sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the throw
from these SYCL classes instead?
@@ -3154,13 +3162,11 @@ _ZN4sycl3_V15queue20wait_and_throw_proxyERKNS0_6detail13code_locationE | |||
_ZN4sycl3_V15queue22memcpyFromDeviceGlobalEPvPKvbmmRKSt6vectorINS0_5eventESaIS6_EE | |||
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_14SubmissionInfoERKNS2_13code_locationEb | |||
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb | |||
_ZNK4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it wasn't you who messed up the sorting here, but please either remove unnecessary changes or clean it up with a preceding PR to just restore the sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have regenerated the linux symbols on a clean branch and it does not appear to change the order at all. I have also double checked by regenerating the symbols on this branch and it made no difference.
Seems that for whatever reason, abi_check.py
just decided to reorder the symbols in this way due to the internal symbols I added/changed.
sycl/source/detail/cg.hpp
Outdated
NDRDescT(sycl::range<Dims_> N, bool SetNumWorkGroups) : Dims{size_t(Dims_)} { | ||
if (SetNumWorkGroups) { | ||
for (size_t I = 0; I < Dims_; ++I) { | ||
NumWorkGroups[I] = N[I]; | ||
} | ||
} else { | ||
for (size_t I = 0; I < Dims_; ++I) { | ||
GlobalSize[I] = N[I]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really weird to me. I know you didn't introduce this SetNumWorkGroups
thing, but it's odd.
From a quick glance, it looks like:
- We always store the
range
passed to the constructor, but potentially in different places. NumWorkGroups
is only used by hierarchical parallelism (parallel_for_work_group
, specifically).
Could we flip the logic here, so that the constructor always unconditionally stores into GlobalSize
, and the parallel_for_work_group
code knows to read GlobalSize
instead of NumWorkGroups
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been looking into this. There is some confusing stuff going on in the unmodified version of handler.cpp:
Lines 1037 to 1079 in 22c8d2f
case kernel_param_kind_t::kind_stream: { | |
// Stream contains several accessors inside. | |
stream *S = static_cast<stream *>(Ptr); | |
detail::AccessorBaseHost *GBufBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf); | |
detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase); | |
detail::Requirement *GBufReq = GBufImpl.get(); | |
addArgsForGlobalAccessor( | |
GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource, | |
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD); | |
++IndexShift; | |
detail::AccessorBaseHost *GOffsetBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset); | |
detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase); | |
detail::Requirement *GOffsetReq = GOfssetImpl.get(); | |
addArgsForGlobalAccessor( | |
GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource, | |
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD); | |
++IndexShift; | |
detail::AccessorBaseHost *GFlushBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf); | |
detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase); | |
detail::Requirement *GFlushReq = GFlushImpl.get(); | |
size_t GlobalSize = impl->MNDRDesc.GlobalSize.size(); | |
// If work group size wasn't set explicitly then it must be recieved | |
// from kernel attribute or set to default values. | |
// For now we can't get this attribute here. | |
// So we just suppose that WG size is always default for stream. | |
// TODO adjust MNDRDesc when device image contains kernel's attribute | |
if (GlobalSize == 0) { | |
// Suppose that work group size is 1 for every dimension | |
GlobalSize = impl->MNDRDesc.NumWorkGroups.size(); | |
} | |
addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size, | |
IsKernelCreatedFromSource, GlobalSize, impl->MArgs, | |
IsESIMD); | |
++IndexShift; | |
addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize, | |
sizeof(S->FlushBufferSize), Index + IndexShift); | |
break; |
It looks like to me that it is expected that it can be the case the GlobalSize
is zero. This means that addArgsForGlobalAccessor
are called with the size argument set to zero and then later on GlobalSize
is checked if it is zero and it it is it is set to the size of NumWorkGroups
.
I am not quite sure how this is working. I would have expected there to be issues passing size of zero to addArgsForGlobalAccessor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other place that NumWorkGroups
is used is in adjectNDRangePerKernel
in sycl/source/detail/scheduler/commands.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because AccImpl->PerWI
just happens to be false so GlobalSize
is not used. Not sure what that variable is meant to signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try using a single variable, and see if anything breaks?
If the other parts of the code are checking for zero GlobalSize
and then reading NumWorkGroups
instead, it seems like you could just remove the check and read GlobalSize
unconditionally,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem to break things. Likely what is happening is if GlobalSize
is all zeros then it is implicitly implied that GlobalSize
needs to be modifed. Such as in adjustNDRangePerKernel
where GlobalSize
is checked if zero and if so, GlobalSize
is set to work group size * NumWorkGroups
.
A lot of very annoying side effects going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also this comment near the bottom of this class:
/// Number of workgroups, used to record the number of workgroups from the
/// simplest form of parallel_for_work_group. If set, all other fields must be
/// zero
std::array<size_t, 3> NumWorkGroups{0, 0, 0};
std::array<size_t, 3> ClusterDimensions{1, 1, 1};
… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively
} | ||
|
||
for (int I = Dims_; I < 3; ++I) { | ||
LocalSize[I] = LocalSizes[0] ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of tests that depend on extra LocalSize dimensions higher than Dims_ being set to zero or one depending on whether LocalSizes[I] is zero or not respectively. RequiredWGSize.NoRequiredSize and RequiredWGSize.HasRequiredSize always fail if extra LocalSize dimensions are always set to 1 and various tests such as work_group_size_prop.cpp
and six others fail if extra LocalSize dimensions are always set to zero. This preserves the old behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to me that this was introduced in the first place. It really should not matter what the value of dimensions higher than Dims_ are and should just be ignored. But now a number of tests depend on this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO to revisit this?
This sort of complexity will have a (small) impact on runtime, but it's also going to make it harder to make changes to NDRDescT
later on. Making sure NDRDescT
returns values we can't explain just to satisfy existing tests is one way to proceed -- but we could also look into whether those tests are actually useful, or rewrite them (and related functionality) to do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the TODO. I can also create a github issue if you think it is a good idea to keep track of this.
… by setting extra dimension values to zero when using spercific constructor
@aelovikov-intel @Pennycook friendly ping. Double checking Aelovikov, you would like me to remove the throw in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't realize I was holding this up. This looks like a definite improvement, and I'm satisfied that there's a TODO to come back and revisit the weird values.
@aelovikov-intel Friendly ping |
Ah sorry, I did not notice you are on vacation. I did not see until hovering over your username and missed it on your profile page. |
I'm not sure how to understand this reply. The PR seems to be using |
Ah, I realise now that I misunderstood you. Your original post was asking about removing the throw from sycl;:range/sycl::id themselves and replacing the operator[] dimension check with an assert. I am currently asking a co-worker about this idea. Looking at the sycl spec it does not seem to say anything about if these classes should throw. I am a bit uncertain as my immediate gut thought is that anything that a user could conceivably do easily should throw and asserts should be used more for sanity checks. |
I have gotten an answer from Gordon Brown. So turns out the spec deliberately does not mandate any kind of error checking for things like range/id subscript operator and leaves it up to the implementation. So it is OK to assert rather than throw. I am unsure though as it is always good to catch errors that the use could make easily. I do think this PR removes the main performance issue of having throws in sycl::range and sycl::id as before it was performing a bunch of subscript operations on them at once. It definitely does increase performance but I have not compared against removing the throw with assert. I think there is also a separate argument that sycl specific objects such as sycl::range/sycl::id should not be used so far into the runtime and instead std containers should be used instead if there is not a good reason. This far into the runtime sycl::range/sycl::id are basically being used as std::arrays with extra restrictions/baggage and less tested. |
I don't think we should be replacing Also friendly ping @intel/llvm-reviewers-runtime for reviews. |
void setNDRangeDescriptor(sycl::range<3> N, bool SetNumWorkGroups); | ||
void setNDRangeDescriptor(sycl::range<3> NumWorkItems, sycl::id<3> Offset); | ||
void setNDRangeDescriptor(sycl::range<3> NumWorkItems, | ||
sycl::range<3> LocalSize, sycl::id<3> Offset); | ||
|
||
void setNDRangeDescriptor(sycl::range<2> N, bool SetNumWorkGroups); | ||
void setNDRangeDescriptor(sycl::range<2> NumWorkItems, sycl::id<2> Offset); | ||
void setNDRangeDescriptor(sycl::range<2> NumWorkItems, | ||
sycl::range<2> LocalSize, sycl::id<2> Offset); | ||
|
||
void setNDRangeDescriptor(sycl::range<1> N, bool SetNumWorkGroups); | ||
void setNDRangeDescriptor(sycl::range<1> NumWorkItems, sycl::id<1> Offset); | ||
void setNDRangeDescriptor(sycl::range<1> NumWorkItems, | ||
sycl::range<1> LocalSize, sycl::id<1> Offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these necessary to be exported? Both "new" and "old".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these functions are used in handler.cpp
, handler.hpp
and also SchedulerTestUtils.hpp
.
There does not seem to be a straight forward way to stop these functions from being exported?
After chatting with a co-worker, I don't think there is an easy way to fix this. There are attributes to set visibility, but I strongly suspect doing so will break things. Complicated by the fact that setNDRangeDescriptor
is used in SchedulerTestUtils.hpp
.
I would have expected that only functions that are marked as __SYCL_EXPORT
would actually be exported, but that appears to not matter? sycl_symbols_linux.dump
contains loads of symbols that are in the detail namespace and should not be used directly by the user. Windows and linux have different visibility defaults but it appears both have been set to export all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look elegant, but I'm not spending time myself trying to improve, so no justification for me to block this PR.
Thanks @aelovikov-intel. I have replied to your comment above. Just want to make sure you have the chance to read it before this PR gets merged. Otherwise I plan to get this PR merged pretty soon. |
@DBDuncan - This PR was an ABI break as it removed symbols. We need those symbols back ASAP or we will have to revert this patch. |
These symbols were accidentily removed in intel#18851
These symbols were accidentality removed in #18851
sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.