Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Nov 1, 2025

Draft of a generic module prefetcher. Implement the core feature and give one example of how to use it with QWEN.

This is able to get very close to compute saturation whereas --async-offload as-is still has a few compute stalls.

Leaving as a draft for now, as I am still trying to find a better way.

Start comfy use QWEN to try it out. You need the following startup args:

--async-offload --fast pinned_memory --reserve-vram 3 

It consumes a bit extra VRAM so you need to --reserve-vram to avoid OOMing.

The async offload streams reason for existence is to transfer from
RAM to GPU. The post processing compute steps are a bonus on the side
stream, but if the compute stream is running a long kernel, it can
stall the side stream, as it wait to type-cast the bias before
transferring the weight. So do a pure xfer of the weight straight up,
then do everything bias, then go back to fix the weight type and do
weight patches.
Implement an API that allows instrumenting a model with a prefetch
queue. Units of work are on the nn.Module level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant