Skip to content

Conversation

@gmlueck
Copy link
Contributor

@gmlueck gmlueck commented Oct 7, 2025

Add a proposed extension specification which allows an application to reuse the same event object in multiple command submissions, rather than creating a new event for each submission.

Add a proposed extension specification which allows an application to
reuse the same event object in multiple command submissions, rather than
creating a new event for each submission.
@gmlueck gmlueck requested a review from a team as a code owner October 7, 2025 21:43
@gmlueck
Copy link
Contributor Author

gmlueck commented Oct 7, 2025

This PR proposes a new extension that relies on changes in #20308. However, I think the two PRs can be approved and merged separately. Note that this PR (#20309) only proposes a new extension, so it's OK if this proposed extension is merged even before #20308 is implemented. It is only necessary that #20308 be implemented by the time the extension proposed in #20309 is implemented.

@CaoZhongZ
Copy link

Just catching up with the discussion, reusable event might need counter-based event. Or at least very straightforward with counter-based event. So, when the backend feature is available we should use it.

Revamp the proposed specification to provide convenience APIs that are
similar to CUDA's `cudaEventRecord` and `cudaStreamWaitEvent` because
this is the immediate request from our customer.

I think we do still want to add a `record_event` property, but I think
we could add that separately as part of the KHR being proposed in
KhronosGroup/SYCL-Docs#922, or as a separate oneapi extension based on
that KHR.
@gmlueck gmlueck changed the title [SYCL][Doc] Add spec to record an event [SYCL][Doc] Add spec to reuse an event Oct 29, 2025
@gmlueck
Copy link
Contributor Author

gmlueck commented Oct 29, 2025

Just catching up with the discussion, reusable event might need counter-based event. Or at least very straightforward with counter-based event. So, when the backend feature is available we should use it.

Thanks, @CaoZhongZ. I agree that counter-based events will make this easier. I updated the proposed spec, and I included a description about how it can work in Level Zero in the "Implementation notes" section. DPC++ already uses counter-based events for in-order queues.

Comment on lines 407 to 414
However, standard events to not directly map because a standard Level Zero event
_E_ cannot be passed to `zeCommandListAppendSignalEvent` until all previous
"signal" operations on _E_ have completed and all previous commands using _E_
as a "wait event" (i.e. dependency) have completed.
To compensate for this, the runtime can track the lifetime of the backend event.
If the backend event associated with `enqueue_event_signal` is still in use,
the runtime can disassociate that backend event from the SYCL event, and
allocate a new backend event.
Copy link
Contributor

@pbalcer pbalcer Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the make_event function doesn't accept a queue, the runtime won't be able to chose what type of event to allocate, since that is tied directly to the type of the command list the event will be used with.

The trade-off here is that counter-based events have less overhead and better semantics, but potentially might not be supported by all types of queues. L0v1 doesn't support these types of events at all. On the other hand, in L0v2, we are exclusively using counter-based events. For now at least. If we go with the proposed specification as is, we'd be locked in into that choice. Which is fine, but we'd be giving up e.g., the ability to use native out of order command lists to implement SYCL out of order queues in the new adapter. Personally I think that's an OK trade-off to make.

So my suggestion is that we support this extension only on L0v2, using counter-based events. The runtime tracking and transparently allocating new standard L0 events in the background to support L0v1 seems very complex and potentially error-prone. Not sure if it's worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer here addresses both your comment above and also the one asking what the "primary benefit" of this extension is.

I think there are two goals of this extension. One goal is to make the SYCL API easier for people migrating code from CUDA. I also thought that restructuring the API in this way would help with performance.

I also want to have a consistent style in SYCL for events. We are creating a new "KHR" extension that changes the way applications submit commands to kernels. This is also an opportunity for us to change the way events work. That new KHR can either return an event when a command is submitted (similar to what SYCL does now), or it can take a "signaling" event as input (similar to the way CUDA and Level Zero work). However, I want to choose one style and use it consistently for all types of queues and for all backends. I don't want to use one style for in-order queues and a different style for out-of-order queues, for example. We should view this PR as related to the KHR. If this PR takes a "signaling" event as input, I'd like to change the KHR in the same way.

@gujinghui is the motivating customer for this extension, but I think his situation could be common for other people migrating CUDA code to SYCL. @gujinghui is porting PyTorch to SYCL. Unsurprisingly, PyTorch code has an internal structure that is similar to the CUDA APIs. For example, I think they have internal APIs similar to cudaEventCreate, cudaEventRecord, and cudaStreamWaitEvent that need to be implemented using SYCL. I think this is possible even if SYCL returns an event when a command is submitted. Here is what I think the code would look like in this case:

// Hypothetical "event" implementation inside of PyTorch.
struct Event_s {
  sycl::event e;
};
using Event = Event_s*;

void EventCreate(Event *evt) {
  // Creating a PyTorch event is mostly a nop in this scenario.  We just create an empty wrapper around
  // a dummy "sycl::event" object.
  *evt = new Event_s;
}

void EventRecord(Event evt, sycl::queue q) {
  // Each time we record an event, we really create a new "sycl::event" and overwrite the previous one.
  evt->e = syclex::enqueue_event_signal(q);
}

void StreamWaitEvent(sycl::queue q, Event evt) {
  syclex::enqueue_event_wait(q, {evt->e});
}

My performance concern here is that each call to EventRecord will drop the shared_ptr reference to the event_impl and then create a new event_impl for the new event. This seems like a lot of recycling both in the SYCL runtime itself and also lower down in the stack because the native events also have to be "recycled".

By contrast, if the SYCL APIs take a signaling event as input, the PyTorch code would look like this:

struct Event_s {
  Event_s(const sycl::event &e) : e{e} {}
  sycl::event e;
};
using Event = Event_s*;

void EventCreate(Event *evt) {
  *evt = new Event_s{syclex::make_event()};
}

void EventRecord(Event evt, sycl::queue q) {
  syclex::enqueue_event_signal(q, evt->e);
}

void StreamWaitEvent(sycl::queue q, Event evt) {
  syclex::enqueue_event_wait(q, {evt->e});
}

So my suggestion is that we support this extension only on L0v2, using counter-based events. The runtime tracking and transparently allocating new standard L0 events in the background to support L0v1 seems very complex and potentially error-prone. Not sure if it's worth the effort.

I think the L0v2 is used only for in-order queues, right? As I said, I want the SYCL API to be the same for both in-order and out-of-order queues. However, that doesn't necessarily mean that the L0v1 adaptor needs to change. We could decide instead to put this code in the SYCL runtime. That might make sense since (I think) we will need similar code to handle OpenCL. The OpenCL API is similar to the current SYCL API where the call to submit a command returns an event. Therefore, I expect there will be some code that overwrites the native event object each time the sycl::event is "recorded". Perhaps this code could be shared between the OpenCL and L0v1 adaptors. I suspect there will also be similar cases in the SYCL runtime when commands depend on a host task, so the same code might be shared there too.

As far as performance goes, I'm OK if out-of-order queues go through a slow path. Our main performance concern is making in-order queues fast on Level Zero, so we should optimize for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the make_event function doesn't accept a queue, the runtime won't be able to chose what type of event to allocate, since that is tied directly to the type of the command list the event will be used with.

Oh, I forgot to address this part of the comment. This is a good observation. I can think of two possible solutions:

  • We could change make_event to take an additional property which tells whether the event will be used with in-order queues vs. out-of-order queues. It would then be an error to use the event as the "signaling" event for the wrong type of queue. Would that be enough to choose the right type of underlying L0 event? Or,

  • We could change make_event to take a queue parameter. In that case, it would be an error to use the event as the "signaling" event for any other queue. @gujinghui, would that be too restrictive for you?

Note that in both cases, we would have to support any type of event in enqueue_event_wait. For example, we need to support an in-order queue waiting on an out-of-order event (or even on an event from a different backend).

Copy link
Contributor

@pbalcer pbalcer Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the L0v2 is used only for in-order queues, right?

No, L0v2 out of order queues are implemented as a collection of in-order command lists that are used in a round-robin fashion, specifically so that we can use counter-based events. We can easily support the proposed extension on any of platforms where L0v2 is the default. But there's a small performance gap between our current v2 out of order queue implementation, and what could be achieved if we used native L0 out of order command lists. So far, all of this was an implementation detail of the adapter, that we can easily change. With this extensions, we would be precluding the implementation from ever (efficiently) using native out of order command lists.
Like I said before, that is an OK trade-off in my view.

The added benefit of supporting this extension only in L0v2 (only with counter based event) is that we wouldn't have to worry about standard events. This makes the question with queues moot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could decide instead to put this code in the SYCL runtime.

That might be best long term, in the context of liboffload and upstreaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added benefit of supporting this extension only in L0v2 (only with counter based event) ...

What prevents us from dropping the L0v1 adaptor entirely? Do we still need it for use with the LTS driver, for example? Remember that the KHR will follow the same pattern as this PR. I think it is not a good option to say that the KHR will not work with the LTS driver.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the make_event function doesn't accept a queue, the runtime won't be able to chose what type of event to allocate, since that is tied directly to the type of the command list the event will be used with.

Oh, I forgot to address this part of the comment. This is a good observation. I can think of two possible solutions:

  • We could change make_event to take an additional property which tells whether the event will be used with in-order queues vs. out-of-order queues. It would then be an error to use the event as the "signaling" event for the wrong type of queue. Would that be enough to choose the right type of underlying L0 event? Or,
  • We could change make_event to take a queue parameter. In that case, it would be an error to use the event as the "signaling" event for any other queue. @gujinghui, would that be too restrictive for you?

Note that in both cases, we would have to support any type of event in enqueue_event_wait. For example, we need to support an in-order queue waiting on an out-of-order event (or even on an event from a different backend).

We should not associate the event with certain queue. It's too restrictive...
If needed, it's OK to add an additional property in this API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may not need to change the interface to make_event. @sergey-semenov what do you think about the implementation proposal in 3af5004?

@pbalcer
Copy link
Contributor

pbalcer commented Oct 31, 2025

If the primary benefit of this extension is to improve performance, I suggest first creating a simple prototype and a microbenchmark that would show what sort of benefit we can expect.

This is a non-trivial change to the L0 adapter, and adds additional restrictions on the implementation. Right now I'm skeptical the benefit will be meaningful.

----
namespace sycl::ext::oneapi::experimental {

void enqueue_event_wait(queue q, const std::vector<event>& evts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct, that this is a counterpart of the existing event barrier, but not returning an event? Would it make sense to add an optional event& evt argument here, similar to enqueue_event_signal, to provide the same functionality as we have today for this type of barrier, but with a reusable event?

I've been sketching out how reusable events can be added to the proposed
KHR sycl_khr_free_function_commands.  Align the names here with the ones
I plan to propose in the KHR.
@gmlueck
Copy link
Contributor Author

gmlueck commented Nov 17, 2025

I'd like to drive this proposal to some sort of conclusion. I think we need the following:

  • Someone from @intel/dpcpp-specification-reviewers needs to review and either approve or make comments on what needs to change, and
  • @sergey-semenov: Please comment here on whether you think the proposed implementation in 3af5004 is reasonable.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gmlueck
Copy link
Contributor Author

gmlueck commented Dec 3, 2025

@intel/llvm-gatekeepers I think this is ready to merge.

If we identify any other issues during implementation, we will still have time to consider changes to the API. This PR only makes the extension "proposed".

@againull againull merged commit abde64e into intel:sycl Dec 9, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants