GGUF RAM optimization, LoRA memory fix, Block swap fix by xiaolibai-sys · Pull Request #2036 · kijai/ComfyUI-WanVideoWrapper

xiaolibai-sys · 2026-06-23T10:22:04Z

Pull Request: GGUF RAM Optimization, LoRA Memory Fix, Block Swap Fix

TL;DR

Fixes critical RAM exhaustion when loading GGUF-quantized models (tested with Wan 2.2 Animate 14B Q5_K_M), rewrites LoRA application to avoid memory doubling, and fixes block swap for GGUF. Three bugfixes plus diagnostic tooling across 12 files.

Changes

1. GGUF On-Demand Tensor Loading (=== RAM peak ===)

Problem: load_weights() built a full sd dict containing data copies of every GGUF tensor simultaneously in CPU RAM. For a 14B model this meant massive RAM peak — often triggering disk swap on 24GB machines.

Fix: Replaced pre-built sd dict with tensor_map (name → reader tensor reference, zero data copies). Each tensor is loaded on-demand during the named_parameters loop, assigned to the model immediately, and freed before the next iteration. Only one tensor's data copy exists in RAM at any given time.

Files: nodes_model_loading.py

2. GGUF Mmap Lifecycle Management (=== RAM reduction after load ===)

Problem: The GGUFReader uses np.memmap, causing the OS to cache the entire GGUF file (~14GB) in the process working set. After load_weights() completes, the file cache persists indefinitely.

Fix: Added close_gguf_readers() / reopen_gguf_readers() to manage mmap lifecycle:

close_gguf_readers(): called after initial weight load to release OS file cache
reopen_gguf_readers(): called before block swap weight reload (when gguf_reader was previously closed)

Files: utils.py (implementation), nodes_model_loading.py, nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py (call sites)

3. Progressive LoRA Loading (=== RAM peak ===)

Problem: set_lora_params() recursively collected all float32 LoRA tensors, then converted them to bfloat16, keeping BOTH copies alive simultaneously.

Fix: Rewrote LoRA application in custom_linear.py to process per-patch:

Pre-convert each LoRA diff to CPU bf16 immediately
Delete dict entries after consumption
Call patcher.patches.clear() after all patching completes

Additionally, gguf/gguf.py passes force_cpu=True and _diag dict for diagnostic output, and patches/scale_weights params flow through _replace_linear.

Files: custom_linear.py (complete rewrite), gguf/gguf.py, nodes_sampler.py, skyreels/nodes.py

4. Block Swap Fix for GGUF

Problem: init_blockswap() had if not patched_linear: return guard that skipped block swap entirely when using GGUF models.

Fix: Restructured init_blockswap() to use three independent conditions:

If block_swap_args → always apply block swap (regardless of GGUF)
If auto_cpu_offload → apply offload
Only guard transformer.to(device) with not patched_linear

Tested configuration: blocks=8, prefetch=2 on 40-block 14B model. Prefetch uses independent CUDA streams for async block transfers with CUDA event synchronization. Before fix: block swap always triggered disk swap regardless of config. After fix: GPU stays at full load, no disk swap.

Files: utils.py

5. Weight Load Signature Caching

Added _weights_load_signature() that computes a hashable tuple of the current weight-load configuration (device, dtype, block_swap_args, compile_args, GGUF presence, patched_linear state, patch count). This signature is stored on the transformer and compared on subsequent calls — if unchanged and weights haven't been offloaded, load_weights() is skipped entirely. Critical for fast multi-window WanAnimate and MultiTalk pipelines.

Files: nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py

6. State Dict Release After Load

After load_weights() completes, patcher.model["sd"] is immediately set to None and gc.collect() + mm.soft_empty_cache() are called. This releases the original state dict before the sampling loop starts, preventing the full weight tensor data from persisting alongside the model's parameter buffers.

Files: nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py

7. Memory Diagnostics

Added log_ram_usage() (psutil-based RSS reporting) and log_memory_peak() (GPU peak memory tracking) at key stages throughout the loading and sampling pipeline. Enables systematic profiling of RAM/VRAM at each pipeline stage.

Files: utils.py (implementation), nodes_model_loading.py, nodes_sampler.py, wanvideo/modules/model.py (call sites)

8. LRU-Bounded Caches

Prevent unbounded dict growth in four caches:

Prompt extender cache (nodes.py): OrderedDict, max 128 entries
Radial attention mask cache (wanvideo/radial_attention/attn_mask.py): OrderedDict, max 8 entries
Context window tracker (context_windows/context.py): max 64 windows with eviction
EDM2 constant cache (Ovi/vae/edm2_utils.py): OrderedDict, max 64 entries

Without bounding, these caches grow indefinitely during long-running ComfyUI sessions.

Files: nodes.py, wanvideo/radial_attention/attn_mask.py, context_windows/context.py, Ovi/vae/edm2_utils.py

9. Minor Fixes

adapter_attn_mask parameter (wanvideo/modules/model.py): Added to WanSelfAttention.forward() and WanI2VCrossAttention.forward() signatures
NAG attention intermediate tensor cleanup (wanvideo/modules/model.py): renamed NAG output to x_text; del x_positive, x_negative after NAG branch; del k, v removed (PyTorch handles RAII, explicit del caused issues with graph retention)
rope_negative_offset default (wanvideo/modules/model.py): changed from 0 to 5 to prevent inverted attention patterns on negative prompt (improves first-frame quality)
Error handler fix (nodes_sampler.py): force_offload in exception handler moved inside except block instead of finally (ensures offload only on actual error, not on normal exit)
Removed unused multitalk_audio_stride variable in nodes_sampler.py

Files Changed (12 files)

File	Change
`custom_linear.py`	LoRA progressive loading rewrite
`nodes_model_loading.py`	GGUF on-demand loading + mmap lifecycle + diagnostics
`utils.py`	GGUF mmap lifecycle + block swap fix + diagnostics
`nodes_sampler.py`	Weight signature caching + sd release + LoRA clear + diagnostics
`nodes.py`	LRU extender cache
`gguf/gguf.py`	set_lora_params_gguf signature (force_cpu, _diag)
`wanvideo/modules/model.py`	Diagnostics + adapter_attn_mask + NAG fix + rope_negative_offset
`multitalk/multitalk_loop.py`	Weight signature + sd release + GGUF reopen
`skyreels/nodes.py`	Weight signature + sd release + LoRA clear + GGUF reopen
`context_windows/context.py`	Window map LRU bounding
`Ovi/vae/edm2_utils.py`	Constant cache LRU bounding
`wanvideo/radial_attention/attn_mask.py`	Radial mask cache LRU bounding

Memory Impact Summary

Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM | Model: Wan 2.2 Animate 14B Q5_K_M (GGUF) | Attention: sageattn3

Scenario	Before	After
GGUF on-demand loading (CPU RAM peak)	Full sd dict preload; low-RAM machines OOM	Per-parameter load-and-release; RAM stays under control
GGUF mmap lifecycle (CPU RAM)	OS caches entire GGUF file; persists after load	Closed after load; system RAM released
LoRA progressive loading (RSS)	f32 raw + bf16 copy coexist simultaneously	Per-patch processing; f32 and bf16 never coexist
Block swap + GGUF	Guard logic skips entirely; always triggered disk swap regardless of config	Works correctly; VRAM stays within GPU capacity
Block swap config	N/A	blocks=8, prefetch=2 on 40-block 14B model; async CUDA copies overlapped with compute
Weight signature caching	Full load_weights() on every loop iteration	Skipped when config unchanged
sd release after load	sd dict persists throughout sampling	Released immediately after load
LRU cache bounding	Unbounded dict growth	Max 8–128 entries per cache
Combined effect (GPU power draw)	~78W (severe swap bottleneck; GPU severely underutilized)	250W+ approaching full 300W TDP (swap eliminated; GPU runs at full load)

Testing

Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM + Windows 11
Model: Wan 2.2 Animate 14B Q5_K_M (GGUF)
Attention backend: sageattn3 (via sageattn3_blackwell, Blackwell-optimized kernel; non-Blackwell GPUs fall back to sageattn v2)

Metric	Before	After	Notes
CPU RAM peak	Severe; triggers OS disk swap	Dramatically lower; no swap	GGUF on-demand + mmap close + LoRA progressive combined
GPU power draw	~78W (GPU idle waiting for memory paging)	250W+ approaching full 300W TDP	Eliminating swap removes IO bottleneck; GPU executes at full speed
GGUF model load	OOM (24GB RAM insufficient for peak)	Loads successfully	RAM bottleneck removed
LoRA application	f32+bf16 full coexistence → RSS peak spike	Progressive processing; peak stays controlled	No dual-copy coexistence
Block swap + GGUF	Non-functional (skipped by guard); always triggered disk swap regardless of config	Works correctly with blocks=8, prefetch=2	VRAM stays within 16GB
Attention backend	sageattn3	sageattn3 (Blackwell-optimized kernel)	Compatible with RTX 5070 Ti

Key finding: In the original code on a 24GB RAM system running a GGUF 14B model, three memory issues compound (GGUF sd dict full preload + mmap OS cache + LoRA f32/bf16 dual copies), driving RAM peak well beyond physical memory and forcing frequent OS swap to disk. The GPU idles at ~78W waiting on CPU/IO. Even with block swap configured, disk swap always triggers. After optimization, RAM peak drops to safe levels, swap is eliminated entirely, and GPU power draw recovers to 250W+ approaching full 300W TDP.

Functional Verification

Wan 2.2 Animate 14B Q5_K_M model loads successfully; CPU RAM stays within safe limits
Block swap works correctly under GGUF with blocks=8, prefetch=2 — reopen_gguf_readers operates without errors
LoRA CustomLinear modules all matched correctly; no memory doubling
sageattn3 attention backend compatible
All existing workflows remain backward compatible

- GGUF on-demand tensor loading: tensor_map avoids full sd dict in RAM - GGUF mmap lifecycle: close/reopen readers for block swap - LoRA progressive loading: per-patch processing avoids f32/bf16 coexistence - Block swap fix: remove guard that skipped block swap for GGUF entirely - Weight load signature caching: skip redundant load_weights calls - State dict release after load: free patcher.model['sd'] immediately - LRU-bounded caches: extender, radial mask, context window, EDM2 constant - Memory diagnostics: log_ram_usage() and log_memory_peak() at key stages - adapter_attn_mask parameter, NAG tensor cleanup, rope_negative_offset=5 - Error handler: force_offload moved inside except block Tested: RTX 5070 Ti 16GB + 24GB DDR5, Wan 2.2 Animate 14B Q5_K_M GGUF, sageattn3, blocks=8 prefetch=2. GPU power: 78W -> 250W+ (full 300W TDP)

xiaolibai-sys force-pushed the fix/gguf-ram-optimization-lora-memory-block-swap branch from 04c231d to 772c3e6 Compare June 23, 2026 10:50

xiaolibai-sys changed the title ~~GGUF RAM optimization, LoRA memory fix, Block swap fix & LongCat 1.5 support~~ GGUF RAM optimization, LoRA memory fix, Block swap fix Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGUF RAM optimization, LoRA memory fix, Block swap fix#2036

GGUF RAM optimization, LoRA memory fix, Block swap fix#2036
xiaolibai-sys wants to merge 1 commit into
kijai:mainfrom
xiaolibai-sys:fix/gguf-ram-optimization-lora-memory-block-swap

xiaolibai-sys commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaolibai-sys commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request: GGUF RAM Optimization, LoRA Memory Fix, Block Swap Fix

TL;DR

Changes

1. GGUF On-Demand Tensor Loading (=== RAM peak ===)

2. GGUF Mmap Lifecycle Management (=== RAM reduction after load ===)

3. Progressive LoRA Loading (=== RAM peak ===)

4. Block Swap Fix for GGUF

5. Weight Load Signature Caching

6. State Dict Release After Load

7. Memory Diagnostics

8. LRU-Bounded Caches

9. Minor Fixes

Files Changed (12 files)

Memory Impact Summary

Testing

Functional Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaolibai-sys commented Jun 23, 2026 •

edited

Loading