Skip to content

Prototype of realm subgraph compilation approach#391

Draft
rohany wants to merge 93 commits intoStanfordLegion:mainfrom
rohany:subgraph-compilation-prototype
Draft

Prototype of realm subgraph compilation approach#391
rohany wants to merge 93 commits intoStanfordLegion:mainfrom
rohany:subgraph-compilation-prototype

Conversation

@rohany
Copy link
Copy Markdown
Contributor

@rohany rohany commented Jan 21, 2026

This PR contains the Realm components of the compilation approach described in https://rohany.github.io/publications/actor-task-duality.pdf. The goal of this PR is not to merge the code directly but to start a conversation about the overall approach, and what major steps we feel like the approach needs in order to be something that we can commit to in the future. Given that, the code itself should likely go through some major refactoring before it is merged, so this PR is here to describe the overall approach, rather than being committed to a particular implementation.

rohany added 30 commits January 19, 2026 12:00
Patched from original Legion commit: 74745a43e3e5194288b6e4172ad4c3c7e28ab713.
Port of Legion commit: 2ff5264743ce02d80f6250eea997274bd6de241d.
Port of Legion commit: 65107043fe1fc89b769bc2d59f75b36b8ef99066.
Port of Legion commit: a62d9c0f5ffb0f3d5b54d6ed23b2b20f5873e8e0.
Port of Legion commit: c3732666b435e022c633a0e17962ded42b580eb5.
Port of Legion commit: 7db7f55046b103c19e5b10135d4881dd1126906c.
Port of Legion commit: 9c3d7aab5cfc1723ef4eb0459ebbdde814790608.
Port of Legion commit: 72c4f27059b815a2d12aa1a7d0e54131c6444a95.
More of the work was around properly handling asynchronous effects in
subgraph executions, allowing for the decoupling of host and device
synchronizations.

This is a port of Legion commit 16e1e4354b93e2b831478c2275af28a633c497d9.
Port of Legion commit: ddcef563df81fafb2819fb68dfeec8944f21445a.
Port of Legion commit: 38e8b6b6071ac939d1d3f6b05afe2878994093ca.
Port of Legion commit: 6f94fac853b74b20981a0a8170cd9fd0bdebe020.
Port of Legion commit: 6821cdec6a2517d39b3646dfb1476ebe900b7878.
Port of Legion commit: c8be23d5a2439e7a282d6248224e7dea7528c1ed.
Port of Legion commit: 613f7e10fca0ae38b85f44b832d3ba9b26636d68.
Port of Legion commit: 71fe0f7d18827e82b2c7d41bbbab58aa22d874b8.
Port of Legion commit: b1373892b553df525198e8623fec4aa30f7e13a1.
* Clean up all subgraph-local state on SubgraphImpl::destroy
* Use a single finish event with an atomic counter instead of
  P finish events (where P is the number of procs in the replay).

Port of Legion commit: 6cc2376ec3401efe552179acdab8ed86473ff54b.
Port of Legion commit: b578e69284557c910594bc5a7af5e98b18a53168.
Port of Legion commit: 22f72f3ba41c848d8c0712189ca8fea7f53833b1.
Port of Legion commit: 841fdc34f609038f12fc1d056d0e7dec584dfdaa.
This commit also adds support for asynchronous runahead in copies.

Port of Legion commit: 35bc5845760ca00f8dd62cc56f748f87afea0470.
Port of Legion commit: f675d9d9be5336e5aa0d89fd642a60755c507aff.
Port of Legion commit: 7a442aca6d988e839592d89da82766a374e4ed9d.
Port of Legion commit: b347e706251761b9e21c0223c94982001cf070b4.
Port of Legion commit: 23719b5abc89c109c2f6de1b28f1c7a61e418907.
Port of Legion commit: 4d48b4ed97ece356f94a24428f82e09bfcb82525.
This commit fixes a few use-after free issues that may happen when the
subgraph state is deleted after another operation finishes while an
in-progress completion is still pending.

Port of Legion commit: 41238dfe68ab164c02a988f5c9569e8f415513b0.
@rohany rohany marked this pull request as draft January 21, 2026 00:21
@lightsighter
Copy link
Copy Markdown
Contributor

Making a note for myself: there are some things in here which seem like obvious things to do as they are general across Realm programs regardless of which kind of hardware we're working on, such as the need for transfer descriptors to be reusable for graph replays or for tasks to by-pass some of the processor scheduling logic during graph replays. Those don't expect changes to the programming model from the client side. The parts that will require more scrutiny involve some of the assumptions made about GPU tasks and their effects as it's not obvious how clients can safely opt-in to that without changes to the interface in some way. We also need to make sure we're handling the case where they can't opt-in to some of those constraints as well. @rohany do you think you can describe the different regimes that GPU tasks fall into and what constraints they currently need to abide by for the Realm graphs to work at SOL?

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Jan 22, 2026

The main contract we need from tasks for this to work is that tasks must promise that they aren't going to recursively launch more tasks.

For GPU tasks in particular, we can go into the "runahead" fast path if the task can promise that it is "stream-ordered", i.e. it only launches kernels / copies etc queued against dependencies in the Realm stream the task is given. If a task can't promise this, i.e. it needs to block on the input stream or something, we can still do most of what is described here, we just won't consider the task as "asynchronous", meaning that it needs to wait for both the host side and device side of upstream tasks and copies to complete when executed within the graph.

@lightsighter
Copy link
Copy Markdown
Contributor

I know we talked about this offline, but documenting some things for others' visibility.

The main contract we need from tasks for this to work is that tasks must promise that they aren't going to recursively launch more tasks.

The problem here doesn't actually have anything to do with sub-tasks specifically. It's more a matter of forward progress. If a task in the graph launches a sub-task, that sub-task may also need to run on the processor where the task in the graph is currently executing. This is especially problematic in cases where the parent task waits on the completion of the child task in the body of its execution. In this case the graph executor will need to yield back the processor to tasks outside of the graph (e.g. when the task body waits on the event). This might be challenging to implement (hard to say without looking at the code more) but should be doable. Note this is true for tasks running on any kind of processor.

For GPU tasks in particular, we can go into the "runahead" fast path if the task can promise that it is "stream-ordered", i.e. it only launches kernels / copies etc queued against dependencies in the Realm stream the task is given.

It actually has to promise even more than that. It has to promise that it's not going to access any data that it might use as part of its execution except on the device. Peeking at instance data without synchronizing with the stream will result in races. It seems like this is a property of the task itself and not necessarily the graph, which might require an extension to the code descriptor interface to know this property of tasks. Alternatively we can annotate it on the graphs themselves since that is the natural place where the optimization will be applied. TBD which is the better approach.

One thing that I'm not entirely sure about here is how this interacts with CUDA's forward progress model. If all you're doing is replaying the graph, then it's pretty obvious we can issue operations to CUDA in a topologically sorted way so that any false dependences CUDA might introduce will not create any cycles. However, I think we can't just issue things to CUDA immediately. We do need to wait for all the event preconditions coming into the graph to be satisfied before we start issuing things to CUDA. Otherwise we might issue CUDA work out of order with operations from inside the graph being issued to CUDA before things like copies/tasks that end up being preconditions of the graph end up issuing their work to CUDA. That can lead to cycles with CUDA's penchant for introducing false dependences. I'm trying to think if there are other scenarios where this could potentially get us into trouble, for example if we have to different graphs replaying in parallel or a graph replaying in parallel with normal Realm work. It's not obvious to me this is safe, although I think if we're careful we can make it safe.

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Jan 23, 2026

We do need to wait for all the event preconditions coming into the graph to be satisfied before we start issuing things to CUDA.

Yes, normal external preconditions are handled as opaque dependencies currently. The runahead implemented is only possible for dependencies that are entirely internal to the subgraph.

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Jan 28, 2026

I added a few comments in the code around some new API functions per request from @eddy16112. Happy to add more around what is confusing, just let me know (here or on slack).


// If we're not working on a subgraph right now, see if there's
// anything to pop on the pending subgraph queue.
if (current_subgraph == nullptr) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think leaking the subgraph details into the scheduler is a good idea. This now becomes very fragile with potential functionality changes to the subgraph risking to break this. We need to abstract this away into some form of SubgraphExecutor at least.

}
}

void XferDes::reset(const std::vector<off_t>& ib_offsets) {
Copy link
Copy Markdown
Contributor

@apryakhin apryakhin Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not very safe design - the pre-planned xd retains channel bindings, ports, memory references. If any of those change it would then be incorrect operation. I am just leaving a note here and will write up a suggestion later. I think it needs a better approach here.

// then we can trigger the subgraph and tell it that the control side
// of this XD execution has completed, which will allow other components
// of the subgraph that depend on launched asynchronous work to continue.
if (subgraph_replay_state != nullptr && launches_async_work()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the branching and do something similar to:

class XferDes {
    XDCompletionHandler* completion_handler; // set at creation time
    // ...
    void mark_completed() {
        completion_handler->on_ib_free(...);
        if (launches_async_work())
            completion_handler->on_async_complete(this);
        else
            completion_handler->on_control_complete(this);
    }
};
class TransferOpCompletionHandler : public XDCompletionHandler {
class SubgraphCompletionHandler : public XDCompletionHandler {

I doubt that this is the best design but I think that's incrementally better than what you have here.


virtual void shutdown(void);

virtual void push_subgraph_replay_context() override;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessary need GPUProcessor to know anything about sub-graphs..I feel this begs to have another abstraction to be added that would be something like:

class ProcessorImpl {
public:
    virtual SubgraphReplayContext* create_replay_context() { return nullptr; }

and then you would add

class GPUSubgraphReplayContext : public SubgraphReplayContext {

then you would:

for (auto* proc : all_procs) {
    replay_contexts.push_back(proc->create_replay_context());
}

then

auto* ctx = replay->subgraph->replay_contexts[proc_index];
ctx->enter();
...
pimpl->execute_task(task_id, args);
...
ctx->exit();

return false;
}

void sync_subgraph_incoming_deps(ProcSubgraphReplayState* all_proc_states, unsigned index, GPUStream* stream) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some thinking I would move this out or abstract away into some form of Subgraph hooks

// Handle synchronization against any pending subgraph work. Optimization
// only applies when this XD is running locally.
if (subgraph_replay_state != nullptr && running_locally()) {
sync_subgraph_incoming_deps(subgraph_replay_state, subgraph_index, stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inserts cuStreamWaitEvent on the stream returned by select_stream()? which is the shared DMA stream pool - the same streams used by non-subgraph copies. Any non-subgraph work already queued on this stream now gets falsely serialized behind the subgraph's upstream dependency. I think subgraph XDs need their own dedicated streams to avoid polluting the ordering of unrelated transfers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do that, but is that really changing anything about how this works today? DMA streams are already getting false dependencies on other DMA operations that happened to get mapped to the same DMA stream anyway. This subgraph work is just another DMA operation like any other pending copy.

// all of these callbacks in the state machine to finish before the
// subgraph instantiation is considered done.
if (operation_has_async_effects(it.op_kind, it.op_index)) {
it.is_final_event = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming your async work on a device has already completed and you have wait know for all the host callbacks to run even though the device work has completed already a long ago? This must be terrible slow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't as bad as it sounds. I'm making sure here that all xd host callbacks are done before the subgraph as a whole marks itself as completed. This doesn't effect dependencies between individual operations in the subgraph, only for the final event trigger of the entire subgraph. What could happen (as described in the comment) is that the subgraph could call itself done and then delete itself before an XD callback for the subgraph was called, causing memory corruption in the XD callback.

if (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY && sg_sched_spin) {
// If we didn't find anything and we're supposed to spin, start spinning.
while (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY) {
queue_entry_value = queue_entry.load_acquire();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we're spinning here? This is inside the scheduler's worker loop - spinning blocks all normal task processing on this processor until some other processor happens to write into this queue slot. If the subgraph is unbalanced, the worker is stuck indefinitely and normal tasks starve. Why not just fall through to the scheduler and let it sleep/wake naturally?

Maybe I am just missing context and don't really understand what are you doing here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was only enabled for certain benchmarks (hence the extra guard of sg_sched_spin). In those benchmarks, even the latency of sleeping the thread and yielding to the OS was noticeable, and there also weren't non-subgraph tasks to run. This is a knob that isn't super important to make it into a final version, though we'd probably want to leave some kind of comment there that it's something to try for certain workloads. Unfortunately, I no longer remember the absolute value of the impact it had, though I doubt it was extremely significant.

// wait for the instance to become ready. For limited cases like this, just
// spin until the waited on event is triggered within a subgraph.
if (ThreadLocal::in_subgraph_exec) {
while (!e->has_triggered(gen, poisoned)) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spin-wait has no bound and no escape hatch. If the event depends on anything that needs this processor thread to make progress (e.g., a completion callback, an active message handler, or another task queued on this same processor), this deadlocks - the thread is spinning and can never service the work that would trigger the event.

The comment says subgraph tasks should never wait on events. Then we should find a way to enforce that. If we can't enforce it because Legion's DeferredBuffer breaks the contract, then the contract is wrong and needs to be renegotiated at the Legion level, not worked around with an unbounded spin in the Realm event system.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't have a good answer here. This code is a hack. Perhaps with bounded pools there are different contracts we can negotiate, but there might be yet other uses of event waits inside of tasks that Legion does. @lightsighter does anything else other than deferred buffer waits come to mind here?

rohany added a commit to rohany/realm that referenced this pull request Mar 25, 2026
This commit ports the XferDes reset functionality from StanfordLegion#391.
rohany added a commit to rohany/realm that referenced this pull request Mar 25, 2026
This commit ports the XferDes reset functionality from StanfordLegion#391.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants