-
Notifications
You must be signed in to change notification settings - Fork 73
Add sycl_khr_free_function_commands extension #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add sycl_khr_free_function_commands extension #922
Conversation
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects. It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be optionally provided to functions like parallel_for; now that there are dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
Co-authored-by: Greg Lueck <[email protected]>
There is no need to constrain T here because T must be device-copyable in order to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for submitting work, so it should be moved to its own extension. This will also give us more time to consider the design and naming of any proposed replacement(s), including how they should interact with new functionality proposed in other KHRs.
There are currently no backends that define interop for reductions, so we can remove these functions for now. If we decide later that these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <[email protected]>
Commands like copy, memcpy, fill, etc are not kernels and so passing a kernel_bundle as a requirement is not meaningful.
Commands like copy, memcpy, fill, etc take their arguments explicitly rather than being captured by a function, and so there is no need to inform the runtime about which accessors are used. If a command uses an accessor, it must have been passed as an argument.
Any accessor passed to a command that will run on the device must have target::device.
These functions are equivalent to the host task submission functions.
| namespace sycl::khr { | ||
|
|
||
| template <typename... Requirements> | ||
| std::optional<event> event_barrier(const queue& q, const requirements<Requirements...>& reqs = {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to reconsider the name of this function. Using the word "event" in the name makes little sense. Neither this function nor command_barrier above directly take any event objects as input. However, both this function and command_barrier do indirectly take event objects via requirements. In fact, both functions have exactly the same semantics regarding the input event(s) -- in both cases the barrier waits until all the events are complete.
Therefore, it is not the event semantics that distinguishes this API from command_barrier. The difference is that command_barrier implicitly waits for all previous commands in the queue to complete, while event_barrier does not.
At one point we had proposed partial_barrier as the name for this function. Can anyone remember why we didn't like that? That seems like a better name to me than event_barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think partial_barrier sounds good. We can also consider command_event_barrier and event_barrier. These seem more descriptive and reflect the actual behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, It did predate the requirements. But now that we have requirement, do we still need both function?
Can we just add a bool to command_barrier like implicit_all or something?
If not indeed partial_barrier sound better.
|
|
||
| _Constraints_: | ||
|
|
||
| * [code]#Requirements# does not contain a [code]#kernel_bundle#. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
John originally proposed in this comment that there should be a constraint here that Requirements does not have any accessors with target target::device. However, this depends on what semantic we want for this launch_host function.
If we expect that launch_host has all the same features as handler::host_task, then we need to allow both target::host_task and target::device accessors because target::device accessors have a special interaction with the interop_handle. However, we have noted in the past that it might be better to split out the interop_handle part of host tasks into a separate API.
If our goal is to provide a cleaner semantic of this launch_host command, then maybe we only want it to support the case where it runs pure host code and eliminate the backend interop part. We could add that later as a separate API. If we go this route, then we do probably want a constraint here that Requirements does not have any accessors with target target::device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have assumed the standard and interop versions integrated into one, consistent with the handler version, but I can see, that it might be cleaner to separate them. We might then have something like launch_host_interop, making it a separate use case, and consistent with the interop_handle class name. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit host tasks accessors to target::host_task done here: 166bd46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 0.002$ is the same for the local_accessor, i should we should restrain this PR to do to much (and it doesn't already a lot). IMO It should be "easy" for people to move to this new KHR, just changing their API call and not their kernel.
If we expect that launch_host has all the same features as handler::host_task
So Yes, I will vote for that.
If one day we want to refactor host_task, to create a new one, we can do that with another API call.
If our goal is to provide a cleaner semantic of this launch_host command,
I think it's our goal too, but maybe not in the PR.
| {note}If an [code]#event_barrier# is submitted with no requirements, then this | ||
| operation may be a no-op.{endnote} | ||
|
|
||
| ''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment here from #644, so it doesn't get lost ... @aelovikov-intel noted in this comment that we should add an Examples section illustrating how to use buffers and accessors with this new KHR.
| namespace sycl::khr { | ||
|
|
||
| template <typename KernelType, typename... Requirements> | ||
| std::optional<event> launch(const queue& q, range<1> r, const requirements<Requirements...>& reqs, const KernelType& k); (1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment from #644, so it doesn't get lost ... @aelovikov-intel proposed in this comment that we should change the type of all the const KernelType& parameters to KernelType&&.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is done in DPC++, for the launch_grouped functions (handler-less), and if the kernel copy is required, it is moved if possible. This might be an opportunity to do some testing if needed.
Co-authored-by: Greg Lueck <[email protected]>
|
The WG discussed this, and feel we need a solution for local memory in this KHR. |
|
Regarding local memory: to me, it seems like the least invasive strategy (as in, it doesn't depend on many other changes) that fits with the current specification of this extension would be using requirements for local accessors - since it's a natural fit with how non-local accessors are proposed to be handled. A future extension for e.g. static work group memory could then make that superfluous where it applies. |
Revamp the proposed specification to provide convenience APIs that are similar to CUDA's `cudaEventRecord` and `cudaStreamWaitEvent` because this is the immediate request from our customer. I think we do still want to add a `record_event` property, but I think we could add that separately as part of the KHR being proposed in KhronosGroup/SYCL-Docs#922, or as a separate oneapi extension based on that KHR.
Revamp the proposed specification to provide convenience APIs that are similar to CUDA's `cudaEventRecord` and `cudaStreamWaitEvent` because this is the immediate request from our customer. I think we do still want to add a `record_event` property, but I think we could add that separately as part of the KHR being proposed in KhronosGroup/SYCL-Docs#922, or as a separate oneapi extension based on that KHR.
|
Agree with @PeterTh , would like to keep the change of this PR "minimal" so we can merge it and then we can discuss new feature. I want to avoid the feature creep problem. This PR is immensely useful as if, so no need to do everything in one go :) |
This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.
It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).