Skip to content

Conversation

@anton-ubi
Copy link
Contributor

@anton-ubi anton-ubi commented Dec 4, 2025

Problem
Hosts with GPU capabilities (ALLOW_GPU=true) incorrectly reject frames that have zero GPU memory requirements. This prevents efficient resource utilization where GPU-enabled hosts should accept both GPU and CPU-only workloads.

Root Cause
The getGpuJobs() method in CoreUnitDispatcher calls removeGpu() when no GPU-specific jobs are found. This method reduces all host resources (CPU cores, memory, GPU resources) to "reserve space for future GPU frames." When normal frames are subsequently evaluated, they fail resource checks due to these artificially reduced limits.

The problematic flow:

  1. Look for GPU jobs → none found
  2. Call removeGpu() → reduces idleCores (-100) and idleMemory (-4GB)
  3. Evaluate normal frames → rejected due to insufficient resources
  4. Call restoreGpu() → too late, frames already rejected

Solution
Add a configuration property dispatcher.gpu.skip_resource_reservation that disables GPU resource reservation when set to true. Default is false to maintain backward compatibility.

Configuration
The new behavior is controlled by environment variable CUEBOT_DISPATCHER_SKIP_GPU_RESERVATION:

  • Default (false): Preserves existing behavior
  • Skip (true): Disables resource reservation, allows full resource utilization

Recap

  • Before fix: GPU hosts may reject CPU-only frames despite having sufficient resources
  • After fix: GPU hosts efficiently handle both GPU and CPU-only workloads when optimization enabled.

Backward compatibility is preserved.
This resolves the counter-intuitive behavior where GPU-capable hosts reject valid CPU-only frames.

@anton-ubi anton-ubi changed the title skip_resource_reservation Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch Dec 4, 2025
@DiegoTavares
Copy link
Collaborator

Looking good so far. It looks like you might have forgotten to add the new property to the test/resource's opencue.properties:

nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.imageworks.spcue.dispatcher.CoreUnitDispatcher]: Constructor threw exception; nested exception is java.lang.IllegalStateException: Required key 'dispatcher.gpu.skip_resource_reservation' not found

@anton-ubi anton-ubi changed the title Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch [cuebot][FIX] Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch Dec 20, 2025
@anton-ubi anton-ubi marked this pull request as ready for review December 20, 2025 08:18
@anton-ubi
Copy link
Contributor Author

It's now ready for review. Thanks for the hint.

@anton-ubi anton-ubi marked this pull request as draft December 25, 2025 00:22
@DiegoTavares
Copy link
Collaborator

I think you might have accidentally marked this PR as draft again.

@anton-ubi
Copy link
Contributor Author

anton-ubi commented Jan 5, 2026

It's actually on purpose. I've been testing it again on our end lately and I'm not seeing the expected behavior it was supposed to fix. I'll keep you posted of any advancement.

Apologies for the noise.

@anton-ubi anton-ubi changed the title [cuebot][FIX] Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch [WIP][cuebot][FIX] Fix GPU Resource Reservation Preventing Non-GPU Frame Dispatch Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants