Small speed improvements to --async-offload #10593

rattus128 · 2025-11-01T16:31:19Z

These two changes increase speed of bus-bound --async-offload. Tentative speedup is about 7% (although its hard to commit to a number on runpod where transfer speed fluctuates).

This is one of the better speedup measurements (QWEN FP16 512x512 on 5090):

Requested to load QwenImage
loaded partially; 22288.75 MB usable, 22278.42 MB loaded, 16689.48 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.59s/it]
Requested to load WanVAE
loaded completely; 267.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 25.90 seconds
got prompt
loaded partially; 22250.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.54s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 19.76 seconds
got prompt
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:15<00:00,  1.55s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 19.71 seconds

After

Requested to load QwenImage
loaded partially; 22298.75 MB usable, 22293.42 MB loaded, 16674.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:14<00:00,  1.40s/it]
Requested to load WanVAE
loaded completely; 267.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 119.17 seconds
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:13<00:00,  1.32s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 17.32 seconds
loaded partially; 22248.75 MB usable, 22242.42 MB loaded, 16725.47 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████| 10/10 [00:13<00:00,  1.32s/it]
Requested to load WanVAE
loaded completely; 265.67 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 17.13 seconds

The async offload streams reason for existence is to transfer from RAM to GPU. The post processing compute steps are a bonus on the side stream, but if the compute stream is running a long kernel, it can stall the side stream, as it wait to type-cast the bias before transferring the weight. So do a pure xfer of the weight straight up, then do everything bias, then go back to fix the weight type and do weight patches.

rattus128 added 2 commits November 2, 2025 01:31

ops: dont take an offload stream if you dont need one

fcdb4a5

rattus128 requested a review from Kosinkadink as a code owner November 1, 2025 16:31

comfyanonymous merged commit 135fa49 into comfyanonymous:master Nov 1, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Small speed improvements to --async-offload #10593

Small speed improvements to --async-offload #10593

rattus128 commented Nov 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Small speed improvements to --async-offload #10593

Small speed improvements to --async-offload #10593

Conversation

rattus128 commented Nov 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants