Prototype of realm subgraph compilation approach#391
Prototype of realm subgraph compilation approach#391rohany wants to merge 93 commits intoStanfordLegion:mainfrom
Conversation
Patched from original Legion commit: 74745a43e3e5194288b6e4172ad4c3c7e28ab713.
Port of Legion commit: 2ff5264743ce02d80f6250eea997274bd6de241d.
Port of Legion commit: 65107043fe1fc89b769bc2d59f75b36b8ef99066.
Port of Legion commit: a62d9c0f5ffb0f3d5b54d6ed23b2b20f5873e8e0.
Port of Legion commit: c3732666b435e022c633a0e17962ded42b580eb5.
This reverts commit d055c09.
Port of Legion commit: 7db7f55046b103c19e5b10135d4881dd1126906c.
Port of Legion commit: 9c3d7aab5cfc1723ef4eb0459ebbdde814790608.
Port of Legion commit: 72c4f27059b815a2d12aa1a7d0e54131c6444a95.
More of the work was around properly handling asynchronous effects in subgraph executions, allowing for the decoupling of host and device synchronizations. This is a port of Legion commit 16e1e4354b93e2b831478c2275af28a633c497d9.
Port of Legion commit: ddcef563df81fafb2819fb68dfeec8944f21445a.
Port of Legion commit: 38e8b6b6071ac939d1d3f6b05afe2878994093ca.
Port of Legion commit: 6f94fac853b74b20981a0a8170cd9fd0bdebe020.
Port of Legion commit: 6821cdec6a2517d39b3646dfb1476ebe900b7878.
Port of Legion commit: c8be23d5a2439e7a282d6248224e7dea7528c1ed.
Port of Legion commit: 613f7e10fca0ae38b85f44b832d3ba9b26636d68.
Port of Legion commit: 71fe0f7d18827e82b2c7d41bbbab58aa22d874b8.
Port of Legion commit: b1373892b553df525198e8623fec4aa30f7e13a1.
* Clean up all subgraph-local state on SubgraphImpl::destroy * Use a single finish event with an atomic counter instead of P finish events (where P is the number of procs in the replay). Port of Legion commit: 6cc2376ec3401efe552179acdab8ed86473ff54b.
Port of Legion commit: b578e69284557c910594bc5a7af5e98b18a53168.
Port of Legion commit: 22f72f3ba41c848d8c0712189ca8fea7f53833b1.
Port of Legion commit: 841fdc34f609038f12fc1d056d0e7dec584dfdaa.
This commit also adds support for asynchronous runahead in copies. Port of Legion commit: 35bc5845760ca00f8dd62cc56f748f87afea0470.
Port of Legion commit: f675d9d9be5336e5aa0d89fd642a60755c507aff.
Port of Legion commit: 7a442aca6d988e839592d89da82766a374e4ed9d.
Port of Legion commit: b347e706251761b9e21c0223c94982001cf070b4.
Port of Legion commit: 23719b5abc89c109c2f6de1b28f1c7a61e418907.
Port of Legion commit: 4d48b4ed97ece356f94a24428f82e09bfcb82525.
This commit fixes a few use-after free issues that may happen when the subgraph state is deleted after another operation finishes while an in-progress completion is still pending. Port of Legion commit: 41238dfe68ab164c02a988f5c9569e8f415513b0.
|
Making a note for myself: there are some things in here which seem like obvious things to do as they are general across Realm programs regardless of which kind of hardware we're working on, such as the need for transfer descriptors to be reusable for graph replays or for tasks to by-pass some of the processor scheduling logic during graph replays. Those don't expect changes to the programming model from the client side. The parts that will require more scrutiny involve some of the assumptions made about GPU tasks and their effects as it's not obvious how clients can safely opt-in to that without changes to the interface in some way. We also need to make sure we're handling the case where they can't opt-in to some of those constraints as well. @rohany do you think you can describe the different regimes that GPU tasks fall into and what constraints they currently need to abide by for the Realm graphs to work at SOL? |
|
The main contract we need from tasks for this to work is that tasks must promise that they aren't going to recursively launch more tasks. For GPU tasks in particular, we can go into the "runahead" fast path if the task can promise that it is "stream-ordered", i.e. it only launches kernels / copies etc queued against dependencies in the Realm stream the task is given. If a task can't promise this, i.e. it needs to block on the input stream or something, we can still do most of what is described here, we just won't consider the task as "asynchronous", meaning that it needs to wait for both the host side and device side of upstream tasks and copies to complete when executed within the graph. |
|
I know we talked about this offline, but documenting some things for others' visibility.
The problem here doesn't actually have anything to do with sub-tasks specifically. It's more a matter of forward progress. If a task in the graph launches a sub-task, that sub-task may also need to run on the processor where the task in the graph is currently executing. This is especially problematic in cases where the parent task waits on the completion of the child task in the body of its execution. In this case the graph executor will need to yield back the processor to tasks outside of the graph (e.g. when the task body waits on the event). This might be challenging to implement (hard to say without looking at the code more) but should be doable. Note this is true for tasks running on any kind of processor.
It actually has to promise even more than that. It has to promise that it's not going to access any data that it might use as part of its execution except on the device. Peeking at instance data without synchronizing with the stream will result in races. It seems like this is a property of the task itself and not necessarily the graph, which might require an extension to the code descriptor interface to know this property of tasks. Alternatively we can annotate it on the graphs themselves since that is the natural place where the optimization will be applied. TBD which is the better approach. One thing that I'm not entirely sure about here is how this interacts with CUDA's forward progress model. If all you're doing is replaying the graph, then it's pretty obvious we can issue operations to CUDA in a topologically sorted way so that any false dependences CUDA might introduce will not create any cycles. However, I think we can't just issue things to CUDA immediately. We do need to wait for all the event preconditions coming into the graph to be satisfied before we start issuing things to CUDA. Otherwise we might issue CUDA work out of order with operations from inside the graph being issued to CUDA before things like copies/tasks that end up being preconditions of the graph end up issuing their work to CUDA. That can lead to cycles with CUDA's penchant for introducing false dependences. I'm trying to think if there are other scenarios where this could potentially get us into trouble, for example if we have to different graphs replaying in parallel or a graph replaying in parallel with normal Realm work. It's not obvious to me this is safe, although I think if we're careful we can make it safe. |
Yes, normal external preconditions are handled as opaque dependencies currently. The runahead implemented is only possible for dependencies that are entirely internal to the subgraph. |
|
I added a few comments in the code around some new API functions per request from @eddy16112. Happy to add more around what is confusing, just let me know (here or on slack). |
|
|
||
| // If we're not working on a subgraph right now, see if there's | ||
| // anything to pop on the pending subgraph queue. | ||
| if (current_subgraph == nullptr) { |
There was a problem hiding this comment.
I don't think leaking the subgraph details into the scheduler is a good idea. This now becomes very fragile with potential functionality changes to the subgraph risking to break this. We need to abstract this away into some form of SubgraphExecutor at least.
| } | ||
| } | ||
|
|
||
| void XferDes::reset(const std::vector<off_t>& ib_offsets) { |
There was a problem hiding this comment.
that's not very safe design - the pre-planned xd retains channel bindings, ports, memory references. If any of those change it would then be incorrect operation. I am just leaving a note here and will write up a suggestion later. I think it needs a better approach here.
| // then we can trigger the subgraph and tell it that the control side | ||
| // of this XD execution has completed, which will allow other components | ||
| // of the subgraph that depend on launched asynchronous work to continue. | ||
| if (subgraph_replay_state != nullptr && launches_async_work()) { |
There was a problem hiding this comment.
I think we should remove the branching and do something similar to:
class XferDes {
XDCompletionHandler* completion_handler; // set at creation time
// ...
void mark_completed() {
completion_handler->on_ib_free(...);
if (launches_async_work())
completion_handler->on_async_complete(this);
else
completion_handler->on_control_complete(this);
}
};
class TransferOpCompletionHandler : public XDCompletionHandler {
class SubgraphCompletionHandler : public XDCompletionHandler {
I doubt that this is the best design but I think that's incrementally better than what you have here.
|
|
||
| virtual void shutdown(void); | ||
|
|
||
| virtual void push_subgraph_replay_context() override; |
There was a problem hiding this comment.
We don't necessary need GPUProcessor to know anything about sub-graphs..I feel this begs to have another abstraction to be added that would be something like:
class ProcessorImpl {
public:
virtual SubgraphReplayContext* create_replay_context() { return nullptr; }
and then you would add
class GPUSubgraphReplayContext : public SubgraphReplayContext {
then you would:
for (auto* proc : all_procs) {
replay_contexts.push_back(proc->create_replay_context());
}
then
auto* ctx = replay->subgraph->replay_contexts[proc_index];
ctx->enter();
...
pimpl->execute_task(task_id, args);
...
ctx->exit();
| return false; | ||
| } | ||
|
|
||
| void sync_subgraph_incoming_deps(ProcSubgraphReplayState* all_proc_states, unsigned index, GPUStream* stream) { |
There was a problem hiding this comment.
this needs some thinking I would move this out or abstract away into some form of Subgraph hooks
| // Handle synchronization against any pending subgraph work. Optimization | ||
| // only applies when this XD is running locally. | ||
| if (subgraph_replay_state != nullptr && running_locally()) { | ||
| sync_subgraph_incoming_deps(subgraph_replay_state, subgraph_index, stream); |
There was a problem hiding this comment.
This inserts cuStreamWaitEvent on the stream returned by select_stream()? which is the shared DMA stream pool - the same streams used by non-subgraph copies. Any non-subgraph work already queued on this stream now gets falsely serialized behind the subgraph's upstream dependency. I think subgraph XDs need their own dedicated streams to avoid polluting the ordering of unrelated transfers.
There was a problem hiding this comment.
I'm happy to do that, but is that really changing anything about how this works today? DMA streams are already getting false dependencies on other DMA operations that happened to get mapped to the same DMA stream anyway. This subgraph work is just another DMA operation like any other pending copy.
| // all of these callbacks in the state machine to finish before the | ||
| // subgraph instantiation is considered done. | ||
| if (operation_has_async_effects(it.op_kind, it.op_index)) { | ||
| it.is_final_event = true; |
There was a problem hiding this comment.
Assuming your async work on a device has already completed and you have wait know for all the host callbacks to run even though the device work has completed already a long ago? This must be terrible slow?
There was a problem hiding this comment.
This isn't as bad as it sounds. I'm making sure here that all xd host callbacks are done before the subgraph as a whole marks itself as completed. This doesn't effect dependencies between individual operations in the subgraph, only for the final event trigger of the entire subgraph. What could happen (as described in the comment) is that the subgraph could call itself done and then delete itself before an XD callback for the subgraph was called, causing memory corruption in the XD callback.
| if (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY && sg_sched_spin) { | ||
| // If we didn't find anything and we're supposed to spin, start spinning. | ||
| while (queue_entry_value == SUBGRAPH_EMPTY_QUEUE_ENTRY) { | ||
| queue_entry_value = queue_entry.load_acquire(); |
There was a problem hiding this comment.
I don't understand why we're spinning here? This is inside the scheduler's worker loop - spinning blocks all normal task processing on this processor until some other processor happens to write into this queue slot. If the subgraph is unbalanced, the worker is stuck indefinitely and normal tasks starve. Why not just fall through to the scheduler and let it sleep/wake naturally?
Maybe I am just missing context and don't really understand what are you doing here?
There was a problem hiding this comment.
This code was only enabled for certain benchmarks (hence the extra guard of sg_sched_spin). In those benchmarks, even the latency of sleeping the thread and yielding to the OS was noticeable, and there also weren't non-subgraph tasks to run. This is a knob that isn't super important to make it into a final version, though we'd probably want to leave some kind of comment there that it's something to try for certain workloads. Unfortunately, I no longer remember the absolute value of the impact it had, though I doubt it was extremely significant.
| // wait for the instance to become ready. For limited cases like this, just | ||
| // spin until the waited on event is triggered within a subgraph. | ||
| if (ThreadLocal::in_subgraph_exec) { | ||
| while (!e->has_triggered(gen, poisoned)) {} |
There was a problem hiding this comment.
This spin-wait has no bound and no escape hatch. If the event depends on anything that needs this processor thread to make progress (e.g., a completion callback, an active message handler, or another task queued on this same processor), this deadlocks - the thread is spinning and can never service the work that would trigger the event.
The comment says subgraph tasks should never wait on events. Then we should find a way to enforce that. If we can't enforce it because Legion's DeferredBuffer breaks the contract, then the contract is wrong and needs to be renegotiated at the Legion level, not worked around with an unbounded spin in the Realm event system.
There was a problem hiding this comment.
Yes, I don't have a good answer here. This code is a hack. Perhaps with bounded pools there are different contracts we can negotiate, but there might be yet other uses of event waits inside of tasks that Legion does. @lightsighter does anything else other than deferred buffer waits come to mind here?
This commit ports the XferDes reset functionality from StanfordLegion#391.
This commit ports the XferDes reset functionality from StanfordLegion#391.
This PR contains the Realm components of the compilation approach described in https://rohany.github.io/publications/actor-task-duality.pdf. The goal of this PR is not to merge the code directly but to start a conversation about the overall approach, and what major steps we feel like the approach needs in order to be something that we can commit to in the future. Given that, the code itself should likely go through some major refactoring before it is merged, so this PR is here to describe the overall approach, rather than being committed to a particular implementation.