Skip to content

GGUF RAM optimization, LoRA memory fix, Block swap fix#2036

Open
xiaolibai-sys wants to merge 1 commit into
kijai:mainfrom
xiaolibai-sys:fix/gguf-ram-optimization-lora-memory-block-swap
Open

GGUF RAM optimization, LoRA memory fix, Block swap fix#2036
xiaolibai-sys wants to merge 1 commit into
kijai:mainfrom
xiaolibai-sys:fix/gguf-ram-optimization-lora-memory-block-swap

Conversation

@xiaolibai-sys

@xiaolibai-sys xiaolibai-sys commented Jun 23, 2026

Copy link
Copy Markdown

Pull Request: GGUF RAM Optimization, LoRA Memory Fix, Block Swap Fix

TL;DR

Fixes critical RAM exhaustion when loading GGUF-quantized models (tested with Wan 2.2 Animate 14B Q5_K_M), rewrites LoRA application to avoid memory doubling, and fixes block swap for GGUF. Three bugfixes plus diagnostic tooling across 12 files.


Changes

1. GGUF On-Demand Tensor Loading (=== RAM peak ===)

Problem: load_weights() built a full sd dict containing data copies of every GGUF tensor simultaneously in CPU RAM. For a 14B model this meant massive RAM peak — often triggering disk swap on 24GB machines.

Fix: Replaced pre-built sd dict with tensor_map (name → reader tensor reference, zero data copies). Each tensor is loaded on-demand during the named_parameters loop, assigned to the model immediately, and freed before the next iteration. Only one tensor's data copy exists in RAM at any given time.

Files: nodes_model_loading.py

2. GGUF Mmap Lifecycle Management (=== RAM reduction after load ===)

Problem: The GGUFReader uses np.memmap, causing the OS to cache the entire GGUF file (~14GB) in the process working set. After load_weights() completes, the file cache persists indefinitely.

Fix: Added close_gguf_readers() / reopen_gguf_readers() to manage mmap lifecycle:

  • close_gguf_readers(): called after initial weight load to release OS file cache
  • reopen_gguf_readers(): called before block swap weight reload (when gguf_reader was previously closed)

Files: utils.py (implementation), nodes_model_loading.py, nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py (call sites)

3. Progressive LoRA Loading (=== RAM peak ===)

Problem: set_lora_params() recursively collected all float32 LoRA tensors, then converted them to bfloat16, keeping BOTH copies alive simultaneously.

Fix: Rewrote LoRA application in custom_linear.py to process per-patch:

  • Pre-convert each LoRA diff to CPU bf16 immediately
  • Delete dict entries after consumption
  • Call patcher.patches.clear() after all patching completes

Additionally, gguf/gguf.py passes force_cpu=True and _diag dict for diagnostic output, and patches/scale_weights params flow through _replace_linear.

Files: custom_linear.py (complete rewrite), gguf/gguf.py, nodes_sampler.py, skyreels/nodes.py

4. Block Swap Fix for GGUF

Problem: init_blockswap() had if not patched_linear: return guard that skipped block swap entirely when using GGUF models.

Fix: Restructured init_blockswap() to use three independent conditions:

  • If block_swap_args → always apply block swap (regardless of GGUF)
  • If auto_cpu_offload → apply offload
  • Only guard transformer.to(device) with not patched_linear

Tested configuration: blocks=8, prefetch=2 on 40-block 14B model. Prefetch uses independent CUDA streams for async block transfers with CUDA event synchronization. Before fix: block swap always triggered disk swap regardless of config. After fix: GPU stays at full load, no disk swap.

Files: utils.py

5. Weight Load Signature Caching

Added _weights_load_signature() that computes a hashable tuple of the current weight-load configuration (device, dtype, block_swap_args, compile_args, GGUF presence, patched_linear state, patch count). This signature is stored on the transformer and compared on subsequent calls — if unchanged and weights haven't been offloaded, load_weights() is skipped entirely. Critical for fast multi-window WanAnimate and MultiTalk pipelines.

Files: nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py

6. State Dict Release After Load

After load_weights() completes, patcher.model["sd"] is immediately set to None and gc.collect() + mm.soft_empty_cache() are called. This releases the original state dict before the sampling loop starts, preventing the full weight tensor data from persisting alongside the model's parameter buffers.

Files: nodes_sampler.py, multitalk/multitalk_loop.py, skyreels/nodes.py

7. Memory Diagnostics

Added log_ram_usage() (psutil-based RSS reporting) and log_memory_peak() (GPU peak memory tracking) at key stages throughout the loading and sampling pipeline. Enables systematic profiling of RAM/VRAM at each pipeline stage.

Files: utils.py (implementation), nodes_model_loading.py, nodes_sampler.py, wanvideo/modules/model.py (call sites)

8. LRU-Bounded Caches

Prevent unbounded dict growth in four caches:

  • Prompt extender cache (nodes.py): OrderedDict, max 128 entries
  • Radial attention mask cache (wanvideo/radial_attention/attn_mask.py): OrderedDict, max 8 entries
  • Context window tracker (context_windows/context.py): max 64 windows with eviction
  • EDM2 constant cache (Ovi/vae/edm2_utils.py): OrderedDict, max 64 entries

Without bounding, these caches grow indefinitely during long-running ComfyUI sessions.

Files: nodes.py, wanvideo/radial_attention/attn_mask.py, context_windows/context.py, Ovi/vae/edm2_utils.py

9. Minor Fixes

  • adapter_attn_mask parameter (wanvideo/modules/model.py): Added to WanSelfAttention.forward() and WanI2VCrossAttention.forward() signatures
  • NAG attention intermediate tensor cleanup (wanvideo/modules/model.py): renamed NAG output to x_text; del x_positive, x_negative after NAG branch; del k, v removed (PyTorch handles RAII, explicit del caused issues with graph retention)
  • rope_negative_offset default (wanvideo/modules/model.py): changed from 0 to 5 to prevent inverted attention patterns on negative prompt (improves first-frame quality)
  • Error handler fix (nodes_sampler.py): force_offload in exception handler moved inside except block instead of finally (ensures offload only on actual error, not on normal exit)
  • Removed unused multitalk_audio_stride variable in nodes_sampler.py

Files Changed (12 files)

File Change
custom_linear.py LoRA progressive loading rewrite
nodes_model_loading.py GGUF on-demand loading + mmap lifecycle + diagnostics
utils.py GGUF mmap lifecycle + block swap fix + diagnostics
nodes_sampler.py Weight signature caching + sd release + LoRA clear + diagnostics
nodes.py LRU extender cache
gguf/gguf.py set_lora_params_gguf signature (force_cpu, _diag)
wanvideo/modules/model.py Diagnostics + adapter_attn_mask + NAG fix + rope_negative_offset
multitalk/multitalk_loop.py Weight signature + sd release + GGUF reopen
skyreels/nodes.py Weight signature + sd release + LoRA clear + GGUF reopen
context_windows/context.py Window map LRU bounding
Ovi/vae/edm2_utils.py Constant cache LRU bounding
wanvideo/radial_attention/attn_mask.py Radial mask cache LRU bounding

Memory Impact Summary

Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM | Model: Wan 2.2 Animate 14B Q5_K_M (GGUF) | Attention: sageattn3

Scenario Before After
GGUF on-demand loading (CPU RAM peak) Full sd dict preload; low-RAM machines OOM Per-parameter load-and-release; RAM stays under control
GGUF mmap lifecycle (CPU RAM) OS caches entire GGUF file; persists after load Closed after load; system RAM released
LoRA progressive loading (RSS) f32 raw + bf16 copy coexist simultaneously Per-patch processing; f32 and bf16 never coexist
Block swap + GGUF Guard logic skips entirely; always triggered disk swap regardless of config Works correctly; VRAM stays within GPU capacity
Block swap config N/A blocks=8, prefetch=2 on 40-block 14B model; async CUDA copies overlapped with compute
Weight signature caching Full load_weights() on every loop iteration Skipped when config unchanged
sd release after load sd dict persists throughout sampling Released immediately after load
LRU cache bounding Unbounded dict growth Max 8–128 entries per cache
Combined effect (GPU power draw) ~78W (severe swap bottleneck; GPU severely underutilized) 250W+ approaching full 300W TDP (swap eliminated; GPU runs at full load)

Testing

Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM + Windows 11
Model: Wan 2.2 Animate 14B Q5_K_M (GGUF)
Attention backend: sageattn3 (via sageattn3_blackwell, Blackwell-optimized kernel; non-Blackwell GPUs fall back to sageattn v2)

Metric Before After Notes
CPU RAM peak Severe; triggers OS disk swap Dramatically lower; no swap GGUF on-demand + mmap close + LoRA progressive combined
GPU power draw ~78W (GPU idle waiting for memory paging) 250W+ approaching full 300W TDP Eliminating swap removes IO bottleneck; GPU executes at full speed
GGUF model load OOM (24GB RAM insufficient for peak) Loads successfully RAM bottleneck removed
LoRA application f32+bf16 full coexistence → RSS peak spike Progressive processing; peak stays controlled No dual-copy coexistence
Block swap + GGUF Non-functional (skipped by guard); always triggered disk swap regardless of config Works correctly with blocks=8, prefetch=2 VRAM stays within 16GB
Attention backend sageattn3 sageattn3 (Blackwell-optimized kernel) Compatible with RTX 5070 Ti

Key finding: In the original code on a 24GB RAM system running a GGUF 14B model, three memory issues compound (GGUF sd dict full preload + mmap OS cache + LoRA f32/bf16 dual copies), driving RAM peak well beyond physical memory and forcing frequent OS swap to disk. The GPU idles at ~78W waiting on CPU/IO. Even with block swap configured, disk swap always triggers. After optimization, RAM peak drops to safe levels, swap is eliminated entirely, and GPU power draw recovers to 250W+ approaching full 300W TDP.

Functional Verification

  • Wan 2.2 Animate 14B Q5_K_M model loads successfully; CPU RAM stays within safe limits
  • Block swap works correctly under GGUF with blocks=8, prefetch=2 — reopen_gguf_readers operates without errors
  • LoRA CustomLinear modules all matched correctly; no memory doubling
  • sageattn3 attention backend compatible
  • All existing workflows remain backward compatible

- GGUF on-demand tensor loading: tensor_map avoids full sd dict in RAM
- GGUF mmap lifecycle: close/reopen readers for block swap
- LoRA progressive loading: per-patch processing avoids f32/bf16 coexistence
- Block swap fix: remove guard that skipped block swap for GGUF entirely
- Weight load signature caching: skip redundant load_weights calls
- State dict release after load: free patcher.model['sd'] immediately
- LRU-bounded caches: extender, radial mask, context window, EDM2 constant
- Memory diagnostics: log_ram_usage() and log_memory_peak() at key stages
- adapter_attn_mask parameter, NAG tensor cleanup, rope_negative_offset=5
- Error handler: force_offload moved inside except block

Tested: RTX 5070 Ti 16GB + 24GB DDR5, Wan 2.2 Animate 14B Q5_K_M GGUF,
sageattn3, blocks=8 prefetch=2. GPU power: 78W -> 250W+ (full 300W TDP)
@xiaolibai-sys xiaolibai-sys force-pushed the fix/gguf-ram-optimization-lora-memory-block-swap branch from 04c231d to 772c3e6 Compare June 23, 2026 10:50
@xiaolibai-sys xiaolibai-sys changed the title GGUF RAM optimization, LoRA memory fix, Block swap fix & LongCat 1.5 support GGUF RAM optimization, LoRA memory fix, Block swap fix Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant