GGUF RAM optimization, LoRA memory fix, Block swap fix#2036
Open
xiaolibai-sys wants to merge 1 commit into
Open
GGUF RAM optimization, LoRA memory fix, Block swap fix#2036xiaolibai-sys wants to merge 1 commit into
xiaolibai-sys wants to merge 1 commit into
Conversation
- GGUF on-demand tensor loading: tensor_map avoids full sd dict in RAM - GGUF mmap lifecycle: close/reopen readers for block swap - LoRA progressive loading: per-patch processing avoids f32/bf16 coexistence - Block swap fix: remove guard that skipped block swap for GGUF entirely - Weight load signature caching: skip redundant load_weights calls - State dict release after load: free patcher.model['sd'] immediately - LRU-bounded caches: extender, radial mask, context window, EDM2 constant - Memory diagnostics: log_ram_usage() and log_memory_peak() at key stages - adapter_attn_mask parameter, NAG tensor cleanup, rope_negative_offset=5 - Error handler: force_offload moved inside except block Tested: RTX 5070 Ti 16GB + 24GB DDR5, Wan 2.2 Animate 14B Q5_K_M GGUF, sageattn3, blocks=8 prefetch=2. GPU power: 78W -> 250W+ (full 300W TDP)
04c231d to
772c3e6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request: GGUF RAM Optimization, LoRA Memory Fix, Block Swap Fix
TL;DR
Fixes critical RAM exhaustion when loading GGUF-quantized models (tested with Wan 2.2 Animate 14B Q5_K_M), rewrites LoRA application to avoid memory doubling, and fixes block swap for GGUF. Three bugfixes plus diagnostic tooling across 12 files.
Changes
1. GGUF On-Demand Tensor Loading (=== RAM peak ===)
Problem:
load_weights()built a fullsddict containing data copies of every GGUF tensor simultaneously in CPU RAM. For a 14B model this meant massive RAM peak — often triggering disk swap on 24GB machines.Fix: Replaced pre-built
sddict withtensor_map(name → reader tensor reference, zero data copies). Each tensor is loaded on-demand during thenamed_parametersloop, assigned to the model immediately, and freed before the next iteration. Only one tensor's data copy exists in RAM at any given time.Files:
nodes_model_loading.py2. GGUF Mmap Lifecycle Management (=== RAM reduction after load ===)
Problem: The GGUFReader uses
np.memmap, causing the OS to cache the entire GGUF file (~14GB) in the process working set. Afterload_weights()completes, the file cache persists indefinitely.Fix: Added
close_gguf_readers()/reopen_gguf_readers()to manage mmap lifecycle:close_gguf_readers(): called after initial weight load to release OS file cachereopen_gguf_readers(): called before block swap weight reload (whengguf_readerwas previously closed)Files:
utils.py(implementation),nodes_model_loading.py,nodes_sampler.py,multitalk/multitalk_loop.py,skyreels/nodes.py(call sites)3. Progressive LoRA Loading (=== RAM peak ===)
Problem:
set_lora_params()recursively collected all float32 LoRA tensors, then converted them to bfloat16, keeping BOTH copies alive simultaneously.Fix: Rewrote LoRA application in
custom_linear.pyto process per-patch:patcher.patches.clear()after all patching completesAdditionally,
gguf/gguf.pypassesforce_cpu=Trueand_diagdict for diagnostic output, andpatches/scale_weightsparams flow through_replace_linear.Files:
custom_linear.py(complete rewrite),gguf/gguf.py,nodes_sampler.py,skyreels/nodes.py4. Block Swap Fix for GGUF
Problem:
init_blockswap()hadif not patched_linear: returnguard that skipped block swap entirely when using GGUF models.Fix: Restructured
init_blockswap()to use three independent conditions:block_swap_args→ always apply block swap (regardless of GGUF)auto_cpu_offload→ apply offloadtransformer.to(device)withnot patched_linearTested configuration:
blocks=8, prefetch=2on 40-block 14B model. Prefetch uses independent CUDA streams for async block transfers with CUDA event synchronization. Before fix: block swap always triggered disk swap regardless of config. After fix: GPU stays at full load, no disk swap.Files:
utils.py5. Weight Load Signature Caching
Added
_weights_load_signature()that computes a hashable tuple of the current weight-load configuration (device, dtype, block_swap_args, compile_args, GGUF presence, patched_linear state, patch count). This signature is stored on the transformer and compared on subsequent calls — if unchanged and weights haven't been offloaded,load_weights()is skipped entirely. Critical for fast multi-window WanAnimate and MultiTalk pipelines.Files:
nodes_sampler.py,multitalk/multitalk_loop.py,skyreels/nodes.py6. State Dict Release After Load
After
load_weights()completes,patcher.model["sd"]is immediately set toNoneandgc.collect()+mm.soft_empty_cache()are called. This releases the original state dict before the sampling loop starts, preventing the full weight tensor data from persisting alongside the model's parameter buffers.Files:
nodes_sampler.py,multitalk/multitalk_loop.py,skyreels/nodes.py7. Memory Diagnostics
Added
log_ram_usage()(psutil-based RSS reporting) andlog_memory_peak()(GPU peak memory tracking) at key stages throughout the loading and sampling pipeline. Enables systematic profiling of RAM/VRAM at each pipeline stage.Files:
utils.py(implementation),nodes_model_loading.py,nodes_sampler.py,wanvideo/modules/model.py(call sites)8. LRU-Bounded Caches
Prevent unbounded dict growth in four caches:
nodes.py):OrderedDict, max 128 entrieswanvideo/radial_attention/attn_mask.py):OrderedDict, max 8 entriescontext_windows/context.py): max 64 windows with evictionOvi/vae/edm2_utils.py):OrderedDict, max 64 entriesWithout bounding, these caches grow indefinitely during long-running ComfyUI sessions.
Files:
nodes.py,wanvideo/radial_attention/attn_mask.py,context_windows/context.py,Ovi/vae/edm2_utils.py9. Minor Fixes
adapter_attn_maskparameter (wanvideo/modules/model.py): Added toWanSelfAttention.forward()andWanI2VCrossAttention.forward()signatureswanvideo/modules/model.py): renamed NAG output tox_text;del x_positive, x_negativeafter NAG branch;del k, vremoved (PyTorch handles RAII, explicit del caused issues with graph retention)rope_negative_offsetdefault (wanvideo/modules/model.py): changed from0to5to prevent inverted attention patterns on negative prompt (improves first-frame quality)nodes_sampler.py):force_offloadin exception handler moved insideexceptblock instead offinally(ensures offload only on actual error, not on normal exit)multitalk_audio_stridevariable innodes_sampler.pyFiles Changed (12 files)
custom_linear.pynodes_model_loading.pyutils.pynodes_sampler.pynodes.pygguf/gguf.pywanvideo/modules/model.pymultitalk/multitalk_loop.pyskyreels/nodes.pycontext_windows/context.pyOvi/vae/edm2_utils.pywanvideo/radial_attention/attn_mask.pyMemory Impact Summary
Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM | Model: Wan 2.2 Animate 14B Q5_K_M (GGUF) | Attention: sageattn3
Testing
Platform: RTX 5070 Ti 16GB + 24GB DDR5 RAM + Windows 11
Model: Wan 2.2 Animate 14B Q5_K_M (GGUF)
Attention backend:
sageattn3(viasageattn3_blackwell, Blackwell-optimized kernel; non-Blackwell GPUs fall back tosageattnv2)Key finding: In the original code on a 24GB RAM system running a GGUF 14B model, three memory issues compound (GGUF sd dict full preload + mmap OS cache + LoRA f32/bf16 dual copies), driving RAM peak well beyond physical memory and forcing frequent OS swap to disk. The GPU idles at ~78W waiting on CPU/IO. Even with block swap configured, disk swap always triggers. After optimization, RAM peak drops to safe levels, swap is eliminated entirely, and GPU power draw recovers to 250W+ approaching full 300W TDP.
Functional Verification
reopen_gguf_readersoperates without errors