Add support for NVFP4/FP8 mixed quantized checkpoints in ComfyUI by mattneel · Pull Request #2029 · kijai/ComfyUI-WanVideoWrapper

mattneel · 2026-06-03T04:38:04Z

This pull request adds support for loading ComfyUI-native quantized checkpoints (NVFP4/FP8 mixed precision) in WanVideoWrapper. It introduces a new loader that reconstructs quantized weights as QuantizedTensor objects, ensuring compatibility with ComfyUI's efficient inference kernels. The changes also update the model loading pipeline to detect and properly handle these quantized checkpoints, avoiding unnecessary conversions and ensuring correct dispatch to the optimized GEMM kernels.

ComfyUI-native quantized checkpoint (NVFP4/FP8) support:

Added a new module comfy_quant_linear.py that detects ComfyUI-native quantized checkpoints and reconstructs quantized weights as QuantizedTensor objects, enabling direct use of ComfyUI's NVFP4/FP8 GEMM kernels.
Updated the model loading function in nodes_model_loading.py to detect ComfyUI quantized checkpoints, invoke the new loader, and skip redundant weight assignments for quantized layers. [1] [2] [3]
Modified the weight renaming logic to avoid interfering with ComfyUI-native quantized checkpoints, preserving their expected tensor names.

Integration with existing model code:

Updated custom_linear.py to ensure quantized weights are kept intact and dispatched correctly, bypassing any conversion that would break quantized inference.
Imported the new quantized checkpoint utilities into nodes_model_loading.py for use in the model loading pipeline.

ComfyUI core (>=0.23) ships native NVFP4 + mixed-precision quantization via comfy.quant_ops, with the FP4/FP8 GEMM kernels provided by comfy_kitchen. Such checkpoints store, per quantized linear, the packed weight (uint8 for NVFP4 / float8 for FP8) plus scale tensors and a per-layer `comfy_quant` JSON marker, and a top-level `_quantization_metadata` header. WanVideoWrapper's loader only handled GGUF and fp8-scaled, so these files failed to load. This adds auto-detected support: when the state dict contains `*.comfy_quant` keys, the affected nn.Linear weights are reconstructed as comfy QuantizedTensor objects (the same way ComfyUI core's _lazy_load_from_state_dict does), so the linear dispatches to comfy_kitchen's scaled_mm_nvfp4 / FP8 GEMM via __torch_dispatch__. No new kernels are introduced; it reuses what ComfyUI ships. - comfy_quant_linear.py: detection + QuantizedTensor reconstruction (NVFP4/FP8) - nodes_model_loading.py: detect in load_weights, reconstruct, and skip the already-loaded quantized params in the main assignment loop Notes/limitations (open to maintainer guidance): - weights load on the main transformer device; block-swap-aware placement and LoRA-merge for quantized layers follow the existing "no merge for quantized weights" rule and are left as follow-ups. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The initial loader bound QuantizedTensor weights but the forward pass failed for NVFP4 layers with `mat1 and mat2 shapes cannot be multiplied (Mx5120 and 2560x5120)`. Three NVFP4-specific issues (FP8 worked because its packed width equals its logical width): - comfy_quant_linear: derive the logical shape from the packed qdata + packing factor (in = qdata.shape[1]*2 for NVFP4, *1 for FP8) instead of trusting module.in_features. The model's Linear is instantiated from the checkpoint's stored weight width; for NVFP4 that's the packed half-width (2560), so using it for Params.orig_shape made dequantize() return a half-width tensor and the GEMM failed. This is the root cause. - comfy_quant_linear: route quantized CustomLinear through _linear_forward_direct so plain F.linear -> aten.linear.default dispatches to comfy_kitchen's NVFP4/FP8 GEMM, instead of the wanvideo.linear_forward custom op (a custom-op boundary can strip the tensor subclass). Also clear scale_weight / is_gguf on those modules. - custom_linear: _prepare_weight returns the QuantizedTensor intact for comfy_quant layers; a `.to(input)` cast there is unnecessary and risks collapsing it. - nodes_model_loading: skip the `.weight_scale`->`.scale_weight` rename for comfy_quant checkpoints, which keep their own scale tensor names. Validated end-to-end: a real Wan2.2-Animate generation (ViTPose pose+face conditioning, 20-step) on the NVFP4-mixed checkpoint matches the FP8 baseline at 27 dB PSNR / 0.94 correlation, visually identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mattneel · 2026-06-03T04:40:13Z

Update — 3317577: validated end-to-end, not just loading.

The initial commit bound the QuantizedTensors but NVFP4 layers still failed in the forward with mat1 and mat2 shapes cannot be multiplied (Mx5120 and 2560x5120). The follow-up fixes three NVFP4-specific issues (FP8 worked throughout because its packed width equals its logical width):

Root cause — packed vs logical width. The wrapper instantiates each Linear from the checkpoint's stored weight shape. NVFP4 packs two FP4 values per uint8, so the stored weight is (out, in/2) → module.in_features is the packed half-width (2560 for a 5120-wide layer). Using it for Params.orig_shape made dequantize() return a half-width tensor and the GEMM failed. Fix: derive the logical shape from qdata.shape[1] * (2 if nvfp4 else 1).
Dispatch. Quantized CustomLinear now uses _linear_forward_direct (plain F.linear → aten.linear.default → comfy_kitchen NVFP4/FP8 GEMM) instead of the wanvideo.linear_forward custom op — a torch.library.custom_op boundary can strip the tensor subclass so __torch_dispatch__ never fires.
_prepare_weight keeps the QuantizedTensor intact for comfy_quant layers (no .to(input) collapse), and the .weight_scale→.scale_weight rename is skipped for these checkpoints.

Validation (RTX 5090, torch 2.12+cu130, comfy_kitchen): a real Wan2.2-Animate-14B generation — ViTPose pose+face conditioning, 832×480, 49 frames, 20-step / cfg 6, seed 42 — comparing an NVFP4-mixed checkpoint (238 NVFP4 + 242 FP8 layers, ~15 GB) against an all-FP8 build (~18 GB). Same seed + identical conditioning → 27.2 dB mean PSNR, 0.94 pixel correlation, ~1.6% mean delta, stable across all 49 frames; outputs are visually indistinguishable and the amplified difference is high-frequency only. ~17.3 GB peak VRAM.

Scope / not yet tested: single-window animation mode, no LoRA (the unmerged WanVideoSetLoRAs path should apply to quantized weights like scaled-fp8 does, but I haven't verified it with NVFP4); block-swap with NVFP4 is untested post-fix. Checkpoints were produced with comfy's own TensorCoreNVFP4Layout.quantize, so the on-disk layout matches what MixedPrecisionOps loads.

nvfp4_vs_fp8_sidebyside.mp4

mattneel and others added 2 commits June 2, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for NVFP4/FP8 mixed quantized checkpoints in ComfyUI#2029

Add support for NVFP4/FP8 mixed quantized checkpoints in ComfyUI#2029
mattneel wants to merge 2 commits into
kijai:mainfrom
mattneel:feat/nvfp4-comfy-quant

mattneel commented Jun 3, 2026

Uh oh!

mattneel commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattneel commented Jun 3, 2026

Uh oh!

mattneel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattneel commented Jun 3, 2026 •

edited

Loading