-
Notifications
You must be signed in to change notification settings - Fork 235
[WIP][cuebot][FIX] Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch #2095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP][cuebot][FIX] Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch #2095
Conversation
|
Looking good so far. It looks like you might have forgotten to add the new property to the test/resource's opencue.properties: |
|
It's now ready for review. Thanks for the hint. |
|
I think you might have accidentally marked this PR as draft again. |
|
It's actually on purpose. I've been testing it again on our end lately and I'm not seeing the expected behavior it was supposed to fix. I'll keep you posted of any advancement. Apologies for the noise. |
Problem
Hosts with GPU capabilities (ALLOW_GPU=true) incorrectly reject frames that have zero GPU memory requirements. This prevents efficient resource utilization where GPU-enabled hosts should accept both GPU and CPU-only workloads.
Root Cause
The getGpuJobs() method in CoreUnitDispatcher calls removeGpu() when no GPU-specific jobs are found. This method reduces all host resources (CPU cores, memory, GPU resources) to "reserve space for future GPU frames." When normal frames are subsequently evaluated, they fail resource checks due to these artificially reduced limits.
The problematic flow:
Solution
Add a configuration property dispatcher.gpu.skip_resource_reservation that disables GPU resource reservation when set to true. Default is false to maintain backward compatibility.
Configuration
The new behavior is controlled by environment variable CUEBOT_DISPATCHER_SKIP_GPU_RESERVATION:
Recap
Backward compatibility is preserved.
This resolves the counter-intuitive behavior where GPU-capable hosts reject valid CPU-only frames.