Skip to content

Conversation

@rattus128
Copy link
Contributor

These two changes increase speed of bus-bound --async-offload. Tentative speedup is about 7% (although its hard to commit to a number on runpod where transfer speed fluctuates).

This is one of the better speedup measurements (QWEN FP16 512x512 on 5090):

Requested to load QwenImage
loaded partially; 22288.75 MB usable, 22278.42 MB loaded, 16689.48 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.59s/it]
Requested to load WanVAE
loaded completely; 267.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 25.90 seconds
got prompt
loaded partially; 22250.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.54s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 19.76 seconds
got prompt
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.55s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 19.71 seconds

After

Requested to load QwenImage
loaded partially; 22298.75 MB usable, 22293.42 MB loaded, 16674.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:14<00:00,  1.40s/it]
Requested to load WanVAE
loaded completely; 267.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 119.17 seconds
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:13<00:00,  1.32s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 17.32 seconds
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:13<00:00,  1.32s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 17.13 seconds

The async offload streams reason for existence is to transfer from
RAM to GPU. The post processing compute steps are a bonus on the side
stream, but if the compute stream is running a long kernel, it can
stall the side stream, as it wait to type-cast the bias before
transferring the weight. So do a pure xfer of the weight straight up,
then do everything bias, then go back to fix the weight type and do
weight patches.
@comfyanonymous comfyanonymous merged commit 135fa49 into comfyanonymous:master Nov 1, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants