Add comprehensive IREE scheduling architecture documentation with dynamic dispatch, custom scheduler implementation, and AOT scheduling guide#1
Conversation
Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
|
@copilot But so far it seems that we are having the workloads pined to a particular core. But instead I was thinking on not reserving them fixed. This would require very high granularity to dispatch the kernels of dispatches to different topologies or CPUs, what would be the best way to approach this? |
…ling without fixed core pinning Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
Added comprehensive documentation for dynamic dispatch without fixed core pinning in commit dfb3450. The docs now cover:
Key approach: Instead of iree_task_affinity_set_t affinity = compute_dynamic_affinity(scheduler, dispatch);
task->affinity_set = affinity; // Different per-dispatch!This enables high granularity where each kernel can go to different cores based on real-time availability, without reserving cores to specific workloads. See new "Dynamic Dispatch" sections in both guides for complete implementation examples. |
|
@copilot So as we investigated IREE does a lot of scheduling at different sections. It also does certain Ahead of Time scheduling and then other in the runtime. Since we have the information of our particular architecture. We know the software and we know the hardware. We are thinking on doing very heavy ahead of time scheduling with high granularity. We were thinking on Flexible Job Shop Scheduling with windowed sections where we take into consideration all that we need and then that would be the exact schedule. BUt I dont know if we would need to do that at the stream layer of the compilation or where? Also I dont know how this would be handle by IREE during runtime. Because it sounds like we are replacing part of the runtime with this approach. But then we should question whether this makes sense or what. So the idea is to maybe anotate the graph. We will compare the different workloads that need to run asynchronously and design a schedule for it if possible as much as possible ahead of time. But i dont know if this is the best, and then also hwo to integrate it as well as which information we might need to design that. This is what Im thinking on using to combine the networks: structions (asynchronous) The compilation portion is the same as above just with the addition of the flag telling the compiler to make the modules export functions with async support. It's possible to have modules that contain a mix of synchronous and asynchronous exports and calls if desired. iree-run-module and other IREE tooling automatically detects asynchronous calls and handles adding the wait and signal fences while applications directly using IREE APIs for execution will need to add the fences themselves. Run the program by passing in the modules in load order (dependencies first): iree-run-module And then here some information that I found relevant from the task/executor: // Copyright 2020 The IREE Authors #ifndef IREE_TASK_EXECUTOR_H_ #include <stdint.h> #include "iree/base/api.h" #ifdef __cplusplus //============================================================================== // A bitfield specifying the scheduling mode used for configuring how (or if) // Options controlling task executor behavior. // Base value added to each executor-local worker index. // TODO(benvanik): add a scope_spin_ns to control wait-idle and other // Maximum duration in nanoseconds each worker should spin waiting for // Minimum size in bytes of each worker thread stack. // Defines the bytes to be allocated and reserved by each worker to use for // Initializes |out_options| to default values. // Base task system executor interface. // Creates a task executor using the specified topology. // Retains the given |executor| for the caller. // Releases the given |executor| from the caller. // Trims pools and caches used by the executor and its workers. // Returns the number of live workers usable by the executor. // Returns an iree_event_t pool managed by the executor. // Acquires a fence for the given |scope| from the executor fence pool. // TODO(benvanik): scheduling mode mutation, compute quota control, etc. // Submits a batch of tasks for execution. // Flushes any pending task batches for execution. // Donates the calling thread to the executor until either |wait_source| #ifdef __cplusplus #endif // IREE_TASK_EXECUTOR_H_ But I do believe that we could take great advantage of scheduling everything for maximum hardware utilization before execution ahead of time. but still have some sort of fallback to how iree handles it. What would be a minimal implementation to test. I would rather prototype fast and fail fast then build on top than to have something super complicated I cant test |
…ion and windowed execution Co-authored-by: copparihollmann <70057799+copparihollmann@users.noreply.github.com>
Added comprehensive AOT scheduling documentation in commit b7db76b addressing your questions about ahead-of-time Flexible Job Shop Scheduling. Key Answers:Where to implement AOT scheduling?
How to encode schedule in IR? stream.cmd.dispatch @workload[%x, %y, %z](%buffer) {
iree.scheduling.start_time_ns = 1500000 : i64,
iree.scheduling.affinity = 240 : i64,
iree.scheduling.resource = "npu"
}Does this replace runtime?
Information needed:
Using async-external: Windowed scheduling: For long-running robotics, compute schedule in sliding windows (e.g., 100ms windows). Runtime executes window-by-window with optional replanning if deviations are large. See new sections in both guides for complete implementation examples including compiler pass code, runtime integration, and end-to-end workflow. |
Documents IREE's multi-layer scheduling system and provides implementation guidance for custom schedulers targeting heterogeneous computing clusters with specialized accelerators (e.g., NPU extensions). Includes static pinning, dynamic dispatch, and ahead-of-time (AOT) scheduling strategies for fine-grained per-kernel scheduling.
Problem
User needs to understand:
Changes
Core Documentation (
docs/website/docs/developers/design-docs/)iree-scheduling-deep-dive.md(1,650+ lines)ScheduleExecution,ScheduleAllocation,ScheduleConcurrency)iree_task_affinity_set_t(64-bit bitmask), topology-aware worker mappingcustom-scheduler-implementation-guide.md(1,780+ lines)local_taskSCHEDULING_QUICK_REFERENCE.md(400+ lines)Key Findings
IREE provides:
task->affinity_set = 0b11110000)queue_execute()IREE lacks:
Scheduling Approaches
Static Pinning (Simpler)
Dynamic Dispatch (Flexible)
Ahead-of-Time Scheduling (Predictable) - NEW
Scheduling Approach Decision Matrix
Recommendation for Robotics: Hybrid AOT+Runtime with windowed scheduling
Files
iree-scheduling-deep-dive.md- architecture analysis + dynamic dispatch + AOT schedulingcustom-scheduler-implementation-guide.md- implementation with static, dynamic & AOT approachesSCHEDULING_QUICK_REFERENCE.md- API reference + dynamic patternsSCHEDULING_README.md- indexIREE_SCHEDULING_INVESTIGATION_SUMMARY.md- executive summaryTotal: 4,800+ lines of documentation. No runtime or compiler code modified.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.