Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
372 changes: 372 additions & 0 deletions IREE_SCHEDULING_INVESTIGATION_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,372 @@
# IREE Scheduling Investigation Summary

**Date**: December 2, 2024
**Issue**: Understanding IREE's scheduling architecture and implementing custom Flexible Job Shop Scheduling

## Investigation Results

### What We Discovered

IREE implements a sophisticated **multi-layer scheduling system** that operates at three distinct levels:

#### 1. Compiler-Level Scheduling (Ahead-of-Time)
- **Location**: `compiler/src/iree/compiler/Dialect/Stream/Transforms/`
- **Key Passes**:
- `ScheduleExecution.cpp` - Partitions operations into executable regions
- `ScheduleAllocation.cpp` - Schedules memory allocations/deallocations
- `ScheduleConcurrency.cpp` - Optimizes concurrent execution opportunities

#### 2. HAL Layer (Device Abstraction)
- **Location**: `runtime/src/iree/hal/`
- **Components**:
- Command buffers record work
- Device queues submit work
- Semaphores/fences provide synchronization
- Drivers implement device-specific scheduling

#### 3. Runtime Task System (Fine-Grained Execution)
- **Location**: `runtime/src/iree/task/`
- **Architecture**: Wavefront-style work-stealing scheduler
- **Components**:
- Task executor coordinates scheduling
- Workers (thread pool) execute tasks
- Topology maps workers to CPU cores
- Affinity sets control task-to-core assignment

### Current Capabilities

✅ **IREE Already Has**:
- Sophisticated task scheduling with work-stealing
- Topology-aware worker placement
- Affinity control for task-to-core mapping
- Concurrent execution of multiple models via timelines
- Pipelined execution with fine-grained dependencies
- Stream-ordered memory allocation

❌ **IREE Does NOT Have**:
- Built-in job shop scheduling algorithm
- Explicit resource reservation (e.g., for NPU)
- Priority-based preemptive scheduling
- Reactive scheduling that adapts to environment
- Cross-model orchestration with custom policies

### Your Specific Use Case

**Hardware Configuration**:
- Cluster 0: 4 general-purpose cores (cores 0-3)
- Cluster 1: 4 cores with NPU RISC-V extension (cores 4-7)
- Requirement: Schedule multiple concurrent MLIR models with job shop constraints
- Target: Real-time robotics with reactive scheduling

**Key Questions Answered**:

1. **Q: How does IREE schedule multiple MLIR files executing concurrently?**
- **A**: Via timeline-based execution using HAL semaphores. Each model can execute on independent timelines, or use fork-join patterns to coordinate. The runtime task system handles fine-grained parallelism within each model.

2. **Q: How does IREE assign dispatches to CPU cores?**
- **A**: Workers are mapped to CPU cores via `iree_task_topology_t`. Tasks specify affinity via `iree_task_affinity_set_t` (64-bit bitmask). The coordinator posts tasks to workers matching the affinity, and workers can steal work from others.

3. **Q: Can IREE handle heterogeneous clusters (one with NPU)?**
- **A**: Yes, through custom topology and affinity control. However, explicit resource management (exclusive NPU access) requires custom implementation.

4. **Q: Is IREE already doing Flexible Job Shop Scheduling?**
- **A**: No. IREE has work-stealing and topology awareness, but not job shop scheduling with precedence constraints, resource reservation, and deadline handling. You need to implement this.

## Implementation Recommendations

### Recommended Approach: Custom HAL Driver ⭐

**Why**: Clean separation, full control, maintainable, testable

**Steps**:
1. Fork `runtime/src/iree/hal/drivers/local_task/` → `job_shop/`
2. Add job shop scheduler to device structure
3. Implement scheduling algorithm in `queue_execute()`
4. Add NPU resource manager with exclusive access
5. Expose scheduling metrics for monitoring

**Estimated Effort**: 2-3 weeks for experienced C developer

### Key Components to Implement

#### 1. Job Shop Scheduler
```c
typedef struct iree_job_shop_scheduler_t {
iree_slim_mutex_t queue_mutex;
iree_scheduler_job_t* ready_queue; // Priority-ordered
iree_scheduler_job_t* pending_queue; // Waiting on dependencies
iree_cluster_state_t clusters[2]; // Track cluster state
iree_npu_manager_t* npu_manager; // Exclusive NPU access
scheduling_policy_t policy; // FIFO/Priority/Deadline/Shortest
} iree_job_shop_scheduler_t;
```

#### 2. NPU Resource Manager
```c
typedef struct iree_npu_manager_t {
iree_task_affinity_set_t npu_core_mask; // Cores 4-7
iree_atomic_int32_t npu_in_use; // Exclusive access
iree_scheduler_job_t* npu_queue; // Pending NPU jobs
} iree_npu_manager_t;
```

#### 3. Job Metadata
```c
typedef struct iree_job_metadata_t {
uint32_t job_id; // Job identifier
uint32_t operation_id; // Operation sequence
uint32_t priority; // 0-255
uint64_t deadline_ns; // Absolute deadline
bool requires_npu; // Needs NPU access
uint32_t estimated_duration; // For scheduling decisions
} iree_job_metadata_t;
```

#### 4. Reactive Scheduling
```c
void update_scheduler_telemetry(
iree_job_shop_scheduler_t* scheduler,
uint32_t cluster_id,
uint32_t temperature,
uint32_t memory_pressure) {

// Thermal throttling
if (temperature > 90) {
scheduler->clusters[cluster_id].max_concurrent_jobs = 2;
} else if (temperature < 70) {
scheduler->clusters[cluster_id].max_concurrent_jobs = 4;
}

// Trigger rescheduling
reschedule(scheduler);
}
```

### Extension Points in IREE

1. **Topology Definition**: Define custom topology matching your hardware
```c
iree_task_topology_t topology;
// Configure clusters 0 and 1 with appropriate masks
```

2. **Affinity Control**: Set task affinity at compile or runtime
```c
task->affinity_set = 0b11110000; // Cores 4-7 (NPU cluster)
```

3. **HAL Driver**: Implement `iree_hal_device_t` interface
```c
static iree_status_t job_shop_device_queue_execute(...) {
// Extract metadata
// Run scheduling algorithm
// Assign to appropriate cluster
// Submit to task executor
}
```

4. **Compiler Annotations**: Add custom attributes to IR (optional)
```mlir
stream.cmd.dispatch @kernel[%x, %y, %z]
attributes {
iree.dispatch.requires_npu = true,
iree.dispatch.priority = 200
}
```

## Documentation Created

We created three comprehensive documents:

### 1. [iree-scheduling-deep-dive.md](./iree-scheduling-deep-dive.md)
**856 lines** - Complete analysis of IREE's scheduling architecture
- Multi-layer scheduling overview
- Compiler scheduling passes
- Runtime task system internals
- CPU core allocation mechanisms
- Concurrent execution patterns
- Current capabilities vs. what's missing
- Key files for implementation

### 2. [custom-scheduler-implementation-guide.md](./custom-scheduler-implementation-guide.md)
**900 lines** - Step-by-step implementation guide
- Complete code examples for custom HAL driver
- Job shop scheduler data structures
- NPU resource manager implementation
- Reactive scheduling for robotics
- Build configuration (CMake/Bazel)
- Testing strategies
- Usage examples

### 3. [SCHEDULING_README.md](./SCHEDULING_README.md)
**94 lines** - Index and quick start guide
- Document overview
- Quick start instructions
- Use cases and key concepts
- Links to related documentation

## Key Files Identified

### Runtime (Task System)
```
runtime/src/iree/task/
├── executor.h/c - Main task executor (coordinator)
├── worker.h/c - Worker thread implementation
├── task.h/c - Task types and lifecycle
├── topology.h/c - CPU topology detection
├── affinity_set.h - Affinity bit manipulation
└── submission.h/c - Task submission API
```

### HAL (Device Abstraction)
```
runtime/src/iree/hal/
├── device.h/c - Device interface
└── drivers/
├── local_task/ - Task-based HAL device (YOUR STARTING POINT)
└── local_sync/ - Synchronous device (reference)
```

### Compiler (Scheduling)
```
compiler/src/iree/compiler/Dialect/Stream/Transforms/
├── ScheduleExecution.cpp - Main execution scheduling
├── ScheduleAllocation.cpp - Memory scheduling
└── ScheduleConcurrency.cpp - Concurrency optimization
```

## Next Steps for Implementation

1. **Phase 1: Prototype** (Week 1)
- Define custom topology for your 2-cluster hardware
- Test basic affinity control
- Verify worker-to-core mapping

2. **Phase 2: Core Scheduler** (Week 2-3)
- Fork `local_task` driver → `job_shop` driver
- Implement job shop scheduler
- Add NPU resource manager
- Integrate with HAL device

3. **Phase 3: Reactive Features** (Week 4)
- Add telemetry integration
- Implement thermal throttling
- Test deadline handling
- Add priority preemption

4. **Phase 4: Validation** (Week 5)
- Test with multiple concurrent models
- Measure scheduling overhead
- Validate NPU exclusivity
- Tune for robotics workload

## Example Usage

```c
// 1. Create topology for your hardware
iree_task_topology_t topology;
configure_2_cluster_topology(&topology); // Clusters 0-1

// 2. Create task executor
iree_task_executor_t* executor;
iree_task_executor_create(options, &topology, allocator, &executor);

// 3. Create custom HAL device with scheduler
iree_hal_device_t* device;
iree_hal_job_shop_device_create("job-shop", executor, allocator, &device);

// 4. Load and execute multiple models
iree_runtime_session_t* session;
load_models(session, {"perception.vmfb", "planning_npu.vmfb", "control.vmfb"});

// 5. Execute with pipelining
execute_with_scheduling(session, device, timeline);

// 6. Monitor and adapt
while (running) {
update_telemetry(device, temperature, memory_pressure);
// Scheduler adapts automatically
}
```

## Performance Considerations

### Scheduling Overhead
- **Goal**: < 10 microseconds per scheduling decision
- **Critical Path**: Lock contention on ready queue
- **Optimization**: Per-cluster queues, lock-free when possible

### NPU Contention
- **Challenge**: Multiple jobs may want NPU
- **Solution**: Priority queue for NPU jobs, preemption if needed
- **Metric**: NPU utilization > 90%

### Thermal Throttling
- **Trigger**: Core temperature > 85°C
- **Action**: Reduce concurrency, migrate work
- **Hysteresis**: Restore at < 70°C

### Deadline Satisfaction
- **Requirement**: 99% of high-priority jobs meet deadlines
- **Approach**: Earliest Deadline First (EDF) scheduling
- **Fallback**: Abort low-priority jobs if needed

## Testing Strategy

### Unit Tests
- Scheduler algorithm correctness
- NPU exclusivity
- Priority ordering
- Thermal throttling

### Integration Tests
- Multiple concurrent models
- Core assignment verification
- Timeline synchronization

### Performance Tests
- Scheduling overhead measurement
- Throughput comparison vs. default
- Stress test with many jobs

### Real-time Tests
- Deadline satisfaction rate
- Latency distribution
- Reactive scheduling behavior

## Conclusion

IREE provides an excellent foundation for custom scheduling:
- ✅ Clean architecture with clear extension points
- ✅ Powerful task system with topology awareness
- ✅ Timeline-based execution for pipelining
- ✅ Well-documented code

You need to add:
- ❌ Job shop scheduling algorithm
- ❌ Resource reservation (NPU)
- ❌ Priority/deadline enforcement
- ❌ Reactive adaptation

**The Good News**: IREE was designed for exactly this kind of extension! The abstractions are clean, the extension points are well-defined, and you can implement custom scheduling without modifying core IREE code.

**Estimated Total Effort**: 4-6 weeks for complete implementation and testing

## Resources

- **Documentation**: `docs/website/docs/developers/design-docs/`
- **IREE Discord**: https://discord.gg/wEWh6Z9nMU
- **GitHub Issues**: https://github.com/iree-org/iree/issues
- **Mailing List**: iree-technical-discussion@lists.lfaidata.foundation

## Contact

For questions about this investigation or implementation:
- Open a GitHub issue with tag `question`
- Ask in IREE Discord #performance channel
- Email the technical discussion mailing list

---

**Investigation completed by**: GitHub Copilot Coding Agent
**Date**: December 2, 2024
Loading