Skip to content

Conversation

@slawekptak
Copy link

This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.

This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.

It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).

Pennycook and others added 30 commits October 17, 2024 15:00
This extension provides an alternative mechanism for submitting commands to a
device via free-functions that require developers to opt-in to the creation of
event objects.

It also proposes alternative names for several commands (e.g., launch) and
simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be
optionally provided to functions like parallel_for; now that there are
dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
There is no need to constrain T here because T must be device-copyable in order
to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for
submitting work, so it should be moved to its own extension.

This will also give us more time to consider the design and naming of any
proposed replacement(s), including how they should interact with new
functionality proposed in other KHRs.
There are currently no backends that define interop for reductions,
so we can remove these functions for now. If we decide later that
these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <[email protected]>
Pennycook and others added 6 commits June 17, 2025 20:44
Commands like copy, memcpy, fill, etc are not kernels and so passing a
kernel_bundle as a requirement is not meaningful.
Commands like copy, memcpy, fill, etc take their arguments explicitly rather
than being captured by a function, and so there is no need to inform the
runtime about which accessors are used. If a command uses an accessor, it must
have been passed as an argument.
Any accessor passed to a command that will run on the device must have
target::device.
These functions are equivalent to the host task submission functions.
namespace sycl::khr {

template <typename... Requirements>
std::optional<event> event_barrier(const queue& q, const requirements<Requirements...>& reqs = {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to reconsider the name of this function. Using the word "event" in the name makes little sense. Neither this function nor command_barrier above directly take any event objects as input. However, both this function and command_barrier do indirectly take event objects via requirements. In fact, both functions have exactly the same semantics regarding the input event(s) -- in both cases the barrier waits until all the events are complete.

Therefore, it is not the event semantics that distinguishes this API from command_barrier. The difference is that command_barrier implicitly waits for all previous commands in the queue to complete, while event_barrier does not.

At one point we had proposed partial_barrier as the name for this function. Can anyone remember why we didn't like that? That seems like a better name to me than event_barrier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think partial_barrier sounds good. We can also consider command_event_barrier and event_barrier. These seem more descriptive and reflect the actual behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, It did predate the requirements. But now that we have requirement, do we still need both function?

Can we just add a bool to command_barrier like implicit_all or something?

If not indeed partial_barrier sound better.


_Constraints_:

* [code]#Requirements# does not contain a [code]#kernel_bundle#.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

John originally proposed in this comment that there should be a constraint here that Requirements does not have any accessors with target target::device. However, this depends on what semantic we want for this launch_host function.

If we expect that launch_host has all the same features as handler::host_task, then we need to allow both target::host_task and target::device accessors because target::device accessors have a special interaction with the interop_handle. However, we have noted in the past that it might be better to split out the interop_handle part of host tasks into a separate API.

If our goal is to provide a cleaner semantic of this launch_host command, then maybe we only want it to support the case where it runs pure host code and eliminate the backend interop part. We could add that later as a separate API. If we go this route, then we do probably want a constraint here that Requirements does not have any accessors with target target::device

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have assumed the standard and interop versions integrated into one, consistent with the handler version, but I can see, that it might be cleaner to separate them. We might then have something like launch_host_interop, making it a separate use case, and consistent with the interop_handle class name. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit host tasks accessors to target::host_task done here: 166bd46

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 0.002$ is the same for the local_accessor, i should we should restrain this PR to do to much (and it doesn't already a lot). IMO It should be "easy" for people to move to this new KHR, just changing their API call and not their kernel.

If we expect that launch_host has all the same features as handler::host_task

So Yes, I will vote for that.

If one day we want to refactor host_task, to create a new one, we can do that with another API call.

If our goal is to provide a cleaner semantic of this launch_host command,

I think it's our goal too, but maybe not in the PR.

{note}If an [code]#event_barrier# is submitted with no requirements, then this
operation may be a no-op.{endnote}

'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment here from #644, so it doesn't get lost ... @aelovikov-intel noted in this comment that we should add an Examples section illustrating how to use buffers and accessors with this new KHR.

namespace sycl::khr {

template <typename KernelType, typename... Requirements>
std::optional<event> launch(const queue& q, range<1> r, const requirements<Requirements...>& reqs, const KernelType& k); (1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment from #644, so it doesn't get lost ... @aelovikov-intel proposed in this comment that we should change the type of all the const KernelType& parameters to KernelType&&.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is done in DPC++, for the launch_grouped functions (handler-less), and if the kernel copy is required, it is moved if possible. This might be an opportunity to do some testing if needed.

@tomdeakin
Copy link
Contributor

The WG discussed this, and feel we need a solution for local memory in this KHR.

@PeterTh
Copy link

PeterTh commented Oct 16, 2025

Regarding local memory: to me, it seems like the least invasive strategy (as in, it doesn't depend on many other changes) that fits with the current specification of this extension would be using requirements for local accessors - since it's a natural fit with how non-local accessors are proposed to be handled. A future extension for e.g. static work group memory could then make that superfluous where it applies.

gmlueck added a commit to gmlueck/llvm that referenced this pull request Oct 29, 2025
Revamp the proposed specification to provide convenience APIs that are
similar to CUDA's `cudaEventRecord` and `cudaStreamWaitEvent` because
this is the immediate request from our customer.

I think we do still want to add a `record_event` property, but I think
we could add that separately as part of the KHR being proposed in
KhronosGroup/SYCL-Docs#922, or as a separate oneapi extension based on
that KHR.
gmlueck added a commit to gmlueck/llvm that referenced this pull request Oct 29, 2025
Revamp the proposed specification to provide convenience APIs that are
similar to CUDA's `cudaEventRecord` and `cudaStreamWaitEvent` because
this is the immediate request from our customer.

I think we do still want to add a `record_event` property, but I think
we could add that separately as part of the KHR being proposed in
KhronosGroup/SYCL-Docs#922, or as a separate oneapi extension based on
that KHR.
@TApplencourt
Copy link
Contributor

Agree with @PeterTh , would like to keep the change of this PR "minimal" so we can merge it and then we can discuss new feature. I want to avoid the feature creep problem. This PR is immensely useful as if, so no need to do everything in one go :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants