Prototype of realm subgraph compilation approach by rohany · Pull Request #391 · StanfordLegion/realm

rohany · 2026-01-21T00:21:37Z

This PR contains the Realm components of the compilation approach described in https://rohany.github.io/publications/actor-task-duality.pdf. The goal of this PR is not to merge the code directly but to start a conversation about the overall approach, and what major steps we feel like the approach needs in order to be something that we can commit to in the future. Given that, the code itself should likely go through some major refactoring before it is merged, so this PR is here to describe the overall approach, rather than being committed to a particular implementation.

Patched from original Legion commit: 74745a43e3e5194288b6e4172ad4c3c7e28ab713.

Port of Legion commit: 2ff5264743ce02d80f6250eea997274bd6de241d.

Port of Legion commit: 65107043fe1fc89b769bc2d59f75b36b8ef99066.

Port of Legion commit: a62d9c0f5ffb0f3d5b54d6ed23b2b20f5873e8e0.

Port of Legion commit: c3732666b435e022c633a0e17962ded42b580eb5.

This reverts commit d055c09.

Port of Legion commit: 7db7f55046b103c19e5b10135d4881dd1126906c.

Port of Legion commit: 9c3d7aab5cfc1723ef4eb0459ebbdde814790608.

Port of Legion commit: 72c4f27059b815a2d12aa1a7d0e54131c6444a95.

More of the work was around properly handling asynchronous effects in subgraph executions, allowing for the decoupling of host and device synchronizations. This is a port of Legion commit 16e1e4354b93e2b831478c2275af28a633c497d9.

Port of Legion commit: ddcef563df81fafb2819fb68dfeec8944f21445a.

Port of Legion commit: 38e8b6b6071ac939d1d3f6b05afe2878994093ca.

Port of Legion commit: 6f94fac853b74b20981a0a8170cd9fd0bdebe020.

Port of Legion commit: 6821cdec6a2517d39b3646dfb1476ebe900b7878.

Port of Legion commit: c8be23d5a2439e7a282d6248224e7dea7528c1ed.

Port of Legion commit: 613f7e10fca0ae38b85f44b832d3ba9b26636d68.

Port of Legion commit: 71fe0f7d18827e82b2c7d41bbbab58aa22d874b8.

Port of Legion commit: b1373892b553df525198e8623fec4aa30f7e13a1.

* Clean up all subgraph-local state on SubgraphImpl::destroy * Use a single finish event with an atomic counter instead of P finish events (where P is the number of procs in the replay). Port of Legion commit: 6cc2376ec3401efe552179acdab8ed86473ff54b.

Port of Legion commit: b578e69284557c910594bc5a7af5e98b18a53168.

Port of Legion commit: 22f72f3ba41c848d8c0712189ca8fea7f53833b1.

Port of Legion commit: 841fdc34f609038f12fc1d056d0e7dec584dfdaa.

This commit also adds support for asynchronous runahead in copies. Port of Legion commit: 35bc5845760ca00f8dd62cc56f748f87afea0470.

Port of Legion commit: f675d9d9be5336e5aa0d89fd642a60755c507aff.

Port of Legion commit: 7a442aca6d988e839592d89da82766a374e4ed9d.

Port of Legion commit: b347e706251761b9e21c0223c94982001cf070b4.

Port of Legion commit: 23719b5abc89c109c2f6de1b28f1c7a61e418907.

Port of Legion commit: 4d48b4ed97ece356f94a24428f82e09bfcb82525.

This commit fixes a few use-after free issues that may happen when the subgraph state is deleted after another operation finishes while an in-progress completion is still pending. Port of Legion commit: 41238dfe68ab164c02a988f5c9569e8f415513b0.

lightsighter · 2026-01-22T06:49:41Z

Making a note for myself: there are some things in here which seem like obvious things to do as they are general across Realm programs regardless of which kind of hardware we're working on, such as the need for transfer descriptors to be reusable for graph replays or for tasks to by-pass some of the processor scheduling logic during graph replays. Those don't expect changes to the programming model from the client side. The parts that will require more scrutiny involve some of the assumptions made about GPU tasks and their effects as it's not obvious how clients can safely opt-in to that without changes to the interface in some way. We also need to make sure we're handling the case where they can't opt-in to some of those constraints as well. @rohany do you think you can describe the different regimes that GPU tasks fall into and what constraints they currently need to abide by for the Realm graphs to work at SOL?

rohany · 2026-01-22T07:12:42Z

The main contract we need from tasks for this to work is that tasks must promise that they aren't going to recursively launch more tasks.

For GPU tasks in particular, we can go into the "runahead" fast path if the task can promise that it is "stream-ordered", i.e. it only launches kernels / copies etc queued against dependencies in the Realm stream the task is given. If a task can't promise this, i.e. it needs to block on the input stream or something, we can still do most of what is described here, we just won't consider the task as "asynchronous", meaning that it needs to wait for both the host side and device side of upstream tasks and copies to complete when executed within the graph.

lightsighter · 2026-01-23T04:40:17Z

I know we talked about this offline, but documenting some things for others' visibility.

The main contract we need from tasks for this to work is that tasks must promise that they aren't going to recursively launch more tasks.

The problem here doesn't actually have anything to do with sub-tasks specifically. It's more a matter of forward progress. If a task in the graph launches a sub-task, that sub-task may also need to run on the processor where the task in the graph is currently executing. This is especially problematic in cases where the parent task waits on the completion of the child task in the body of its execution. In this case the graph executor will need to yield back the processor to tasks outside of the graph (e.g. when the task body waits on the event). This might be challenging to implement (hard to say without looking at the code more) but should be doable. Note this is true for tasks running on any kind of processor.

For GPU tasks in particular, we can go into the "runahead" fast path if the task can promise that it is "stream-ordered", i.e. it only launches kernels / copies etc queued against dependencies in the Realm stream the task is given.

It actually has to promise even more than that. It has to promise that it's not going to access any data that it might use as part of its execution except on the device. Peeking at instance data without synchronizing with the stream will result in races. It seems like this is a property of the task itself and not necessarily the graph, which might require an extension to the code descriptor interface to know this property of tasks. Alternatively we can annotate it on the graphs themselves since that is the natural place where the optimization will be applied. TBD which is the better approach.

One thing that I'm not entirely sure about here is how this interacts with CUDA's forward progress model. If all you're doing is replaying the graph, then it's pretty obvious we can issue operations to CUDA in a topologically sorted way so that any false dependences CUDA might introduce will not create any cycles. However, I think we can't just issue things to CUDA immediately. We do need to wait for all the event preconditions coming into the graph to be satisfied before we start issuing things to CUDA. Otherwise we might issue CUDA work out of order with operations from inside the graph being issued to CUDA before things like copies/tasks that end up being preconditions of the graph end up issuing their work to CUDA. That can lead to cycles with CUDA's penchant for introducing false dependences. I'm trying to think if there are other scenarios where this could potentially get us into trouble, for example if we have to different graphs replaying in parallel or a graph replaying in parallel with normal Realm work. It's not obvious to me this is safe, although I think if we're careful we can make it safe.

rohany · 2026-01-23T05:45:26Z

We do need to wait for all the event preconditions coming into the graph to be satisfied before we start issuing things to CUDA.

Yes, normal external preconditions are handled as opaque dependencies currently. The runahead implemented is only possible for dependencies that are entirely internal to the subgraph.

rohany · 2026-01-28T19:56:56Z

I added a few comments in the code around some new API functions per request from @eddy16112. Happy to add more around what is confusing, just let me know (here or on slack).

apryakhin · 2026-03-10T18:15:32Z

src/realm/tasks.cc


+  // If we're not working on a subgraph right now, see if there's
+  // anything to pop on the pending subgraph queue.
+  if (current_subgraph == nullptr) {


I don't think leaking the subgraph details into the scheduler is a good idea. This now becomes very fragile with potential functionality changes to the subgraph risking to break this. We need to abstract this away into some form of SubgraphExecutor at least.

apryakhin · 2026-03-10T18:32:39Z

src/realm/transfer/channel.cc

+    }
+  }
+
+  void XferDes::reset(const std::vector<off_t>& ib_offsets) {


that's not very safe design - the pre-planned xd retains channel bindings, ports, memory references. If any of those change it would then be incorrect operation. I am just leaving a note here and will write up a suggestion later. I think it needs a better approach here.

apryakhin · 2026-03-10T18:43:06Z

src/realm/transfer/channel.cc

+    // then we can trigger the subgraph and tell it that the control side
+    // of this XD execution has completed, which will allow other components
+    // of the subgraph that depend on launched asynchronous work to continue.
+    if (subgraph_replay_state != nullptr && launches_async_work()) {


I think we should remove the branching and do something similar to:

class XferDes { XDCompletionHandler* completion_handler; // set at creation time // ... void mark_completed() { completion_handler->on_ib_free(...); if (launches_async_work()) completion_handler->on_async_complete(this); else completion_handler->on_control_complete(this); } };

class TransferOpCompletionHandler : public XDCompletionHandler { class SubgraphCompletionHandler : public XDCompletionHandler {

I doubt that this is the best design but I think that's incrementally better than what you have here.

apryakhin · 2026-03-10T19:18:45Z

src/realm/cuda/cuda_internal.h


      virtual void shutdown(void);

+      virtual void push_subgraph_replay_context() override;


We don't necessary need GPUProcessor to know anything about sub-graphs..I feel this begs to have another abstraction to be added that would be something like:

class ProcessorImpl { public: virtual SubgraphReplayContext* create_replay_context() { return nullptr; }

and then you would add

class GPUSubgraphReplayContext : public SubgraphReplayContext {

then you would:

for (auto* proc : all_procs) { replay_contexts.push_back(proc->create_replay_context()); }

then

auto* ctx = replay->subgraph->replay_contexts[proc_index]; ctx->enter(); ... pimpl->execute_task(task_id, args); ... ctx->exit();

apryakhin · 2026-03-10T19:24:00Z

src/realm/cuda/cuda_internal.cc

      return false;
    }

+    void sync_subgraph_incoming_deps(ProcSubgraphReplayState* all_proc_states, unsigned index, GPUStream* stream) {


this needs some thinking I would move this out or abstract away into some form of Subgraph hooks

apryakhin · 2026-03-24T00:44:55Z

src/realm/cuda/cuda_internal.cc

+        // Handle synchronization against any pending subgraph work. Optimization
+        // only applies when this XD is running locally.
+        if (subgraph_replay_state != nullptr && running_locally()) {
+          sync_subgraph_incoming_deps(subgraph_replay_state, subgraph_index, stream);


This inserts cuStreamWaitEvent on the stream returned by select_stream()? which is the shared DMA stream pool - the same streams used by non-subgraph copies. Any non-subgraph work already queued on this stream now gets falsely serialized behind the subgraph's upstream dependency. I think subgraph XDs need their own dedicated streams to avoid polluting the ordering of unrelated transfers.

I'm happy to do that, but is that really changing anything about how this works today? DMA streams are already getting false dependencies on other DMA operations that happened to get mapped to the same DMA stream anyway. This subgraph work is just another DMA operation like any other pending copy.

apryakhin · 2026-03-24T00:56:44Z

src/realm/subgraph_impl.cc

+      // all of these callbacks in the state machine to finish before the
+      // subgraph instantiation is considered done.
+      if (operation_has_async_effects(it.op_kind, it.op_index)) {
+        it.is_final_event = true;


Assuming your async work on a device has already completed and you have wait know for all the host callbacks to run even though the device work has completed already a long ago? This must be terrible slow?

This isn't as bad as it sounds. I'm making sure here that all xd host callbacks are done before the subgraph as a whole marks itself as completed. This doesn't effect dependencies between individual operations in the subgraph, only for the final event trigger of the entire subgraph. What could happen (as described in the comment) is that the subgraph could call itself done and then delete itself before an XD callback for the subgraph was called, causing memory corruption in the XD callback.

apryakhin · 2026-03-24T01:01:29Z

src/realm/tasks.cc

+      if (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY && sg_sched_spin) {
+        // If we didn't find anything and we're supposed to spin, start spinning.
+        while (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY) {
+          queue_entry_value = queue_entry.load_acquire();


I don't understand why we're spinning here? This is inside the scheduler's worker loop - spinning blocks all normal task processing on this processor until some other processor happens to write into this queue slot. If the subgraph is unbalanced, the worker is stuck indefinitely and normal tasks starve. Why not just fall through to the scheduler and let it sleep/wake naturally?

Maybe I am just missing context and don't really understand what are you doing here?

This code was only enabled for certain benchmarks (hence the extra guard of sg_sched_spin). In those benchmarks, even the latency of sleeping the thread and yielding to the OS was noticeable, and there also weren't non-subgraph tasks to run. This is a knob that isn't super important to make it into a final version, though we'd probably want to leave some kind of comment there that it's something to try for certain workloads. Unfortunately, I no longer remember the absolute value of the impact it had, though I doubt it was extremely significant.

apryakhin · 2026-03-24T01:10:13Z

src/realm/event_impl.cc

+    // wait for the instance to become ready. For limited cases like this, just
+    // spin until the waited on event is triggered within a subgraph.
+    if (ThreadLocal::in_subgraph_exec) {
+      while (!e->has_triggered(gen, poisoned)) {}


This spin-wait has no bound and no escape hatch. If the event depends on anything that needs this processor thread to make progress (e.g., a completion callback, an active message handler, or another task queued on this same processor), this deadlocks - the thread is spinning and can never service the work that would trigger the event.

The comment says subgraph tasks should never wait on events. Then we should find a way to enforce that. If we can't enforce it because Legion's DeferredBuffer breaks the contract, then the contract is wrong and needs to be renegotiated at the Legion level, not worked around with an unbounded spin in the Realm event system.

Yes, I don't have a good answer here. This code is a hack. Perhaps with bounded pools there are different contracts we can negotiate, but there might be yet other uses of event waits inside of tasks that Legion does. @lightsighter does anything else other than deferred buffer waits come to mind here?

This commit ports the XferDes reset functionality from StanfordLegion#391.

rohany added 30 commits January 19, 2026 12:00

runtime/realm: MVP attempt to develop a static subgraph

96ef304

Patched from original Legion commit: 74745a43e3e5194288b6e4172ad4c3c7e28ab713.

runtime/realm: small optimizations

d4f7f95

src/realm: fixup for previous commit, make execute_task public

0899959

runtime/realm: support interpolations for tasks in static schedule

e50fef4

Port of Legion commit: 2ff5264743ce02d80f6250eea997274bd6de241d.

runtime/realm: add arrival support to static subgraph implementation

d7c51b0

Port of Legion commit: 65107043fe1fc89b769bc2d59f75b36b8ef99066.

runtime/realm: support external preconditions in subgraph replay

5fdb088

Port of Legion commit: a62d9c0f5ffb0f3d5b54d6ed23b2b20f5873e8e0.

runtime/realm: fix bug in external precondition triggering

ff218c7

runtime/realm: hack to support preloading task funcptrs

d055c09

Port of Legion commit: c3732666b435e022c633a0e17962ded42b580eb5.

Revert "runtime/realm: hack to support preloading task funcptrs"

dc01512

This reverts commit d055c09.

runtime/realm: support for deferred subgraph execution

70baf63

Port of Legion commit: 7db7f55046b103c19e5b10135d4881dd1126906c.

runtime/realm: fix typo in subgraph instantiation

577b639

runtime/realm: stop using the slow merge_events(set<>) API

764593b

runtime/realm: switch to fresh precondition array for each instantiation

2f7b68c

Port of Legion commit: 9c3d7aab5cfc1723ef4eb0459ebbdde814790608.

runtime/realm: more cleanup for deferred subgraph instantiations

e889e8c

Port of Legion commit: 72c4f27059b815a2d12aa1a7d0e54131c6444a95.

runtime/realm: bugfixes

9629b82

Port of Legion commit: ddcef563df81fafb2819fb68dfeec8944f21445a.

src/realm/cuda: use dedicated gpu workers

2af8510

Port of Legion commit: 38e8b6b6071ac939d1d3f6b05afe2878994093ca.

src/realm/cuda: actually record into the cuda event

53ea72e

Port of Legion commit: 6f94fac853b74b20981a0a8170cd9fd0bdebe020.

runtime/realm: cleanup before merge

7037bfe

Port of Legion commit: 6821cdec6a2517d39b3646dfb1476ebe900b7878.

runtime/realm: small cleanup

69561be

Port of Legion commit: c8be23d5a2439e7a282d6248224e7dea7528c1ed.

runtime/realm: switch to a dynamic queue in subgraph replay

4fe7c4e

Port of Legion commit: 613f7e10fca0ae38b85f44b832d3ba9b26636d68.

runtime/realm: bugfixes+cleanup

b1d2ff3

Port of Legion commit: 71fe0f7d18827e82b2c7d41bbbab58aa22d874b8.

runtime/realm: support a static and dynamic partition of the subgraph

1d065fb

Port of Legion commit: b1373892b553df525198e8623fec4aa30f7e13a1.

runtime/realm: add support for external postconditions

b1ab28a

runtime/realm: optimize copies in subgraph replays

ee6dd57

Port of Legion commit: b578e69284557c910594bc5a7af5e98b18a53168.

runtime/realm: hack to run copies on application processors

de77e35

Port of Legion commit: 22f72f3ba41c848d8c0712189ca8fea7f53833b1.

src/realm/transfer: small cleanup to xd resetting

24e1818

Port of Legion commit: 841fdc34f609038f12fc1d056d0e7dec584dfdaa.

runtime/realm: more progress towards copy fast paths

05a8faf

This commit also adds support for asynchronous runahead in copies. Port of Legion commit: 35bc5845760ca00f8dd62cc56f748f87afea0470.

runtime/realm: bugfixes for asynchronous subgraph operations

66e4cc9

Port of Legion commit: f675d9d9be5336e5aa0d89fd642a60755c507aff.

rohany added 7 commits January 20, 2026 15:38

runtime/realm: fix macro

ee7d22a

Port of Legion commit: 7a442aca6d988e839592d89da82766a374e4ed9d.

runtime/realm: special case event waits within subgraph executions

35222b5

Port of Legion commit: b347e706251761b9e21c0223c94982001cf070b4.

runtime/realm: make the subgraph scheduler spin/yield customizable

30e2d0f

Port of Legion commit: 23719b5abc89c109c2f6de1b28f1c7a61e418907.

runtime/realm: make subgraph barrier arrivals happen on trigger

0078e86

Port of Legion commit: 4d48b4ed97ece356f94a24428f82e09bfcb82525.

Merge branch 'main' into subgraph-compilation-prototype

a0ab59e

src/realm: clean up build warning fallout

b6cdee6

rohany marked this pull request as draft January 21, 2026 00:21

rohany requested review from apryakhin, eddy16112, lightsighter and muraj January 21, 2026 00:22

src/realm: add a few comments to help guide review

20f95c3

apryakhin reviewed Mar 10, 2026

View reviewed changes

apryakhin reviewed Mar 24, 2026

View reviewed changes

rohany added a commit to rohany/realm that referenced this pull request Mar 25, 2026

src/realm: allow for XferDes objects to be reset

013ef06

This commit ports the XferDes reset functionality from StanfordLegion#391.

rohany added a commit to rohany/realm that referenced this pull request Mar 25, 2026

src/realm: allow for XferDes objects to be reset

835aae2

This commit ports the XferDes reset functionality from StanfordLegion#391.

rohany mentioned this pull request Mar 25, 2026

src/realm: allow for XferDes objects to be reset #416

Open


		virtual void shutdown(void);

		virtual void push_subgraph_replay_context() override;

Conversation

rohany commented Jan 21, 2026

Uh oh!

lightsighter commented Jan 22, 2026

Uh oh!

rohany commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Jan 23, 2026

Uh oh!

rohany commented Jan 23, 2026

Uh oh!

rohany commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apryakhin Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohany commented Jan 22, 2026 •

edited

Loading

apryakhin Mar 10, 2026 •

edited

Loading