Skip to content

[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 1, 2025

Conversation

DBDuncan
Copy link
Contributor

@DBDuncan DBDuncan commented Jun 6, 2025

sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.

sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.
@DBDuncan DBDuncan requested a review from a team as a code owner June 6, 2025 15:29
@DBDuncan DBDuncan requested a review from aelovikov-intel June 6, 2025 15:29
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the throw from these SYCL classes instead?

@@ -3154,13 +3162,11 @@ _ZN4sycl3_V15queue20wait_and_throw_proxyERKNS0_6detail13code_locationE
_ZN4sycl3_V15queue22memcpyFromDeviceGlobalEPvPKvbmmRKSt6vectorINS0_5eventESaIS6_EE
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_14SubmissionInfoERKNS2_13code_locationEb
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb
_ZNK4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it wasn't you who messed up the sorting here, but please either remove unnecessary changes or clean it up with a preceding PR to just restore the sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have regenerated the linux symbols on a clean branch and it does not appear to change the order at all. I have also double checked by regenerating the symbols on this branch and it made no difference.

Seems that for whatever reason, abi_check.py just decided to reorder the symbols in this way due to the internal symbols I added/changed.

Comment on lines 73 to 82
NDRDescT(sycl::range<Dims_> N, bool SetNumWorkGroups) : Dims{size_t(Dims_)} {
if (SetNumWorkGroups) {
for (size_t I = 0; I < Dims_; ++I) {
NumWorkGroups[I] = N[I];
}
} else {
for (size_t I = 0; I < Dims_; ++I) {
GlobalSize[I] = N[I];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really weird to me. I know you didn't introduce this SetNumWorkGroups thing, but it's odd.

From a quick glance, it looks like:

  • We always store the range passed to the constructor, but potentially in different places.
  • NumWorkGroups is only used by hierarchical parallelism (parallel_for_work_group, specifically).

Could we flip the logic here, so that the constructor always unconditionally stores into GlobalSize, and the parallel_for_work_group code knows to read GlobalSize instead of NumWorkGroups?

Copy link
Contributor Author

@DBDuncan DBDuncan Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been looking into this. There is some confusing stuff going on in the unmodified version of handler.cpp:

llvm/sycl/source/handler.cpp

Lines 1037 to 1079 in 22c8d2f

case kernel_param_kind_t::kind_stream: {
// Stream contains several accessors inside.
stream *S = static_cast<stream *>(Ptr);
detail::AccessorBaseHost *GBufBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf);
detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase);
detail::Requirement *GBufReq = GBufImpl.get();
addArgsForGlobalAccessor(
GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
++IndexShift;
detail::AccessorBaseHost *GOffsetBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset);
detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase);
detail::Requirement *GOffsetReq = GOfssetImpl.get();
addArgsForGlobalAccessor(
GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
++IndexShift;
detail::AccessorBaseHost *GFlushBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf);
detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase);
detail::Requirement *GFlushReq = GFlushImpl.get();
size_t GlobalSize = impl->MNDRDesc.GlobalSize.size();
// If work group size wasn't set explicitly then it must be recieved
// from kernel attribute or set to default values.
// For now we can't get this attribute here.
// So we just suppose that WG size is always default for stream.
// TODO adjust MNDRDesc when device image contains kernel's attribute
if (GlobalSize == 0) {
// Suppose that work group size is 1 for every dimension
GlobalSize = impl->MNDRDesc.NumWorkGroups.size();
}
addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size,
IsKernelCreatedFromSource, GlobalSize, impl->MArgs,
IsESIMD);
++IndexShift;
addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize,
sizeof(S->FlushBufferSize), Index + IndexShift);
break;

It looks like to me that it is expected that it can be the case the GlobalSize is zero. This means that addArgsForGlobalAccessor are called with the size argument set to zero and then later on GlobalSize is checked if it is zero and it it is it is set to the size of NumWorkGroups.

I am not quite sure how this is working. I would have expected there to be issues passing size of zero to addArgsForGlobalAccessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other place that NumWorkGroups is used is in adjectNDRangePerKernel in sycl/source/detail/scheduler/commands.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because AccImpl->PerWI just happens to be false so GlobalSize is not used. Not sure what that variable is meant to signal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try using a single variable, and see if anything breaks?

If the other parts of the code are checking for zero GlobalSize and then reading NumWorkGroups instead, it seems like you could just remove the check and read GlobalSize unconditionally,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem to break things. Likely what is happening is if GlobalSize is all zeros then it is implicitly implied that GlobalSize needs to be modifed. Such as in adjustNDRangePerKernel where GlobalSize is checked if zero and if so, GlobalSize is set to work group size * NumWorkGroups.

A lot of very annoying side effects going on.

Copy link
Contributor Author

@DBDuncan DBDuncan Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also this comment near the bottom of this class:

  /// Number of workgroups, used to record the number of workgroups from the
  /// simplest form of parallel_for_work_group. If set, all other fields must be
  /// zero
  std::array<size_t, 3> NumWorkGroups{0, 0, 0};
  std::array<size_t, 3> ClusterDimensions{1, 1, 1};

DBDuncan added 2 commits June 11, 2025 15:58
… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively
}

for (int I = Dims_; I < 3; ++I) {
LocalSize[I] = LocalSizes[0] ? 1 : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of tests that depend on extra LocalSize dimensions higher than Dims_ being set to zero or one depending on whether LocalSizes[I] is zero or not respectively. RequiredWGSize.NoRequiredSize and RequiredWGSize.HasRequiredSize always fail if extra LocalSize dimensions are always set to 1 and various tests such as work_group_size_prop.cpp and six others fail if extra LocalSize dimensions are always set to zero. This preserves the old behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to me that this was introduced in the first place. It really should not matter what the value of dimensions higher than Dims_ are and should just be ignored. But now a number of tests depend on this behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO to revisit this?

This sort of complexity will have a (small) impact on runtime, but it's also going to make it harder to make changes to NDRDescT later on. Making sure NDRDescT returns values we can't explain just to satisfy existing tests is one way to proceed -- but we could also look into whether those tests are actually useful, or rewrite them (and related functionality) to do the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the TODO. I can also create a github issue if you think it is a good idea to keep track of this.

… by setting extra dimension values to zero when using spercific constructor
@DBDuncan
Copy link
Contributor Author

DBDuncan commented Jun 18, 2025

@aelovikov-intel @Pennycook friendly ping. Double checking Aelovikov, you would like me to remove the throw in getExternalSemaphore() in sycl/source/detail/cg.hpp? Otherwise feedback is appreciated!

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't realize I was holding this up. This looks like a definite improvement, and I'm satisfied that there's a TODO to come back and revisit the weird values.

@DBDuncan
Copy link
Contributor Author

@aelovikov-intel Friendly ping

@DBDuncan
Copy link
Contributor Author

Ah sorry, I did not notice you are on vacation. I did not see until hovering over your username and missed it on your profile page.

@aelovikov-intel
Copy link
Contributor

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc

I do not see any requirement to throw so asserts should be fine.

I'm not sure how to understand this reply. The PR seems to be using std::array still and not a "lighter" "noexcept" sycl::range.

@DBDuncan
Copy link
Contributor Author

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc
I do not see any requirement to throw so asserts should be fine.

I'm not sure how to understand this reply. The PR seems to be using std::array still and not a "lighter" "noexcept" sycl::range.

Ah, I realise now that I misunderstood you.

Your original post was asking about removing the throw from sycl;:range/sycl::id themselves and replacing the operator[] dimension check with an assert.

I am currently asking a co-worker about this idea. Looking at the sycl spec it does not seem to say anything about if these classes should throw. I am a bit uncertain as my immediate gut thought is that anything that a user could conceivably do easily should throw and asserts should be used more for sanity checks.

@DBDuncan
Copy link
Contributor Author

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc
I do not see any requirement to throw so asserts should be fine.

I'm not sure how to understand this reply. The PR seems to be using std::array still and not a "lighter" "noexcept" sycl::range.

Ah, I realise now that I misunderstood you.

Your original post was asking about removing the throw from sycl;:range/sycl::id themselves and replacing the operator[] dimension check with an assert.

I am currently asking a co-worker about this idea. Looking at the sycl spec it does not seem to say anything about if these classes should throw. I am a bit uncertain as my immediate gut thought is that anything that a user could conceivably do easily should throw and asserts should be used more for sanity checks.

I have gotten an answer from Gordon Brown. So turns out the spec deliberately does not mandate any kind of error checking for things like range/id subscript operator and leaves it up to the implementation. So it is OK to assert rather than throw.

I am unsure though as it is always good to catch errors that the use could make easily. I do think this PR removes the main performance issue of having throws in sycl::range and sycl::id as before it was performing a bunch of subscript operations on them at once. It definitely does increase performance but I have not compared against removing the throw with assert.

I think there is also a separate argument that sycl specific objects such as sycl::range/sycl::id should not be used so far into the runtime and instead std containers should be used instead if there is not a good reason. This far into the runtime sycl::range/sycl::id are basically being used as std::arrays with extra restrictions/baggage and less tested.

@ProGTX
Copy link
Contributor

ProGTX commented Jun 30, 2025

I don't think we should be replacing throws with asserts in id and range classes in this PR, that seems out of scope.

Also friendly ping @intel/llvm-reviewers-runtime for reviews.

Comment on lines +3841 to +3854
void setNDRangeDescriptor(sycl::range<3> N, bool SetNumWorkGroups);
void setNDRangeDescriptor(sycl::range<3> NumWorkItems, sycl::id<3> Offset);
void setNDRangeDescriptor(sycl::range<3> NumWorkItems,
sycl::range<3> LocalSize, sycl::id<3> Offset);

void setNDRangeDescriptor(sycl::range<2> N, bool SetNumWorkGroups);
void setNDRangeDescriptor(sycl::range<2> NumWorkItems, sycl::id<2> Offset);
void setNDRangeDescriptor(sycl::range<2> NumWorkItems,
sycl::range<2> LocalSize, sycl::id<2> Offset);

void setNDRangeDescriptor(sycl::range<1> N, bool SetNumWorkGroups);
void setNDRangeDescriptor(sycl::range<1> NumWorkItems, sycl::id<1> Offset);
void setNDRangeDescriptor(sycl::range<1> NumWorkItems,
sycl::range<1> LocalSize, sycl::id<1> Offset);
Copy link
Contributor

@aelovikov-intel aelovikov-intel Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these necessary to be exported? Both "new" and "old".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these functions are used in handler.cpp, handler.hpp and also SchedulerTestUtils.hpp.

There does not seem to be a straight forward way to stop these functions from being exported?
After chatting with a co-worker, I don't think there is an easy way to fix this. There are attributes to set visibility, but I strongly suspect doing so will break things. Complicated by the fact that setNDRangeDescriptor is used in SchedulerTestUtils.hpp.

I would have expected that only functions that are marked as __SYCL_EXPORT would actually be exported, but that appears to not matter? sycl_symbols_linux.dump contains loads of symbols that are in the detail namespace and should not be used directly by the user. Windows and linux have different visibility defaults but it appears both have been set to export all.

Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look elegant, but I'm not spending time myself trying to improve, so no justification for me to block this PR.

@DBDuncan
Copy link
Contributor Author

DBDuncan commented Jul 1, 2025

Thanks @aelovikov-intel. I have replied to your comment above. Just want to make sure you have the chance to read it before this PR gets merged. Otherwise I plan to get this PR merged pretty soon.

@aelovikov-intel aelovikov-intel merged commit 96caa74 into intel:sycl Jul 1, 2025
25 checks passed
@steffenlarsen
Copy link
Contributor

@DBDuncan - This PR was an ABI break as it removed symbols. We need those symbols back ASAP or we will have to revert this patch.

DBDuncan added a commit to codeplaysoftware/intel-llvm-mirror that referenced this pull request Jul 2, 2025
These symbols were accidentily removed in intel#18851
steffenlarsen pushed a commit that referenced this pull request Jul 3, 2025
These symbols were accidentality removed in #18851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants