Skip to content

Add support for NVFP4/FP8 mixed quantized checkpoints in ComfyUI#2029

Open
mattneel wants to merge 2 commits into
kijai:mainfrom
mattneel:feat/nvfp4-comfy-quant
Open

Add support for NVFP4/FP8 mixed quantized checkpoints in ComfyUI#2029
mattneel wants to merge 2 commits into
kijai:mainfrom
mattneel:feat/nvfp4-comfy-quant

Conversation

@mattneel

@mattneel mattneel commented Jun 3, 2026

Copy link
Copy Markdown

This pull request adds support for loading ComfyUI-native quantized checkpoints (NVFP4/FP8 mixed precision) in WanVideoWrapper. It introduces a new loader that reconstructs quantized weights as QuantizedTensor objects, ensuring compatibility with ComfyUI's efficient inference kernels. The changes also update the model loading pipeline to detect and properly handle these quantized checkpoints, avoiding unnecessary conversions and ensuring correct dispatch to the optimized GEMM kernels.

ComfyUI-native quantized checkpoint (NVFP4/FP8) support:

  • Added a new module comfy_quant_linear.py that detects ComfyUI-native quantized checkpoints and reconstructs quantized weights as QuantizedTensor objects, enabling direct use of ComfyUI's NVFP4/FP8 GEMM kernels.
  • Updated the model loading function in nodes_model_loading.py to detect ComfyUI quantized checkpoints, invoke the new loader, and skip redundant weight assignments for quantized layers. [1] [2] [3]
  • Modified the weight renaming logic to avoid interfering with ComfyUI-native quantized checkpoints, preserving their expected tensor names.

Integration with existing model code:

  • Updated custom_linear.py to ensure quantized weights are kept intact and dispatched correctly, bypassing any conversion that would break quantized inference.
  • Imported the new quantized checkpoint utilities into nodes_model_loading.py for use in the model loading pipeline.

mattneel and others added 2 commits June 2, 2026 21:39
ComfyUI core (>=0.23) ships native NVFP4 + mixed-precision quantization via
comfy.quant_ops, with the FP4/FP8 GEMM kernels provided by comfy_kitchen. Such
checkpoints store, per quantized linear, the packed weight (uint8 for NVFP4 /
float8 for FP8) plus scale tensors and a per-layer `comfy_quant` JSON marker,
and a top-level `_quantization_metadata` header. WanVideoWrapper's loader only
handled GGUF and fp8-scaled, so these files failed to load.

This adds auto-detected support: when the state dict contains `*.comfy_quant`
keys, the affected nn.Linear weights are reconstructed as comfy QuantizedTensor
objects (the same way ComfyUI core's _lazy_load_from_state_dict does), so the
linear dispatches to comfy_kitchen's scaled_mm_nvfp4 / FP8 GEMM via
__torch_dispatch__. No new kernels are introduced; it reuses what ComfyUI ships.

- comfy_quant_linear.py: detection + QuantizedTensor reconstruction (NVFP4/FP8)
- nodes_model_loading.py: detect in load_weights, reconstruct, and skip the
  already-loaded quantized params in the main assignment loop

Notes/limitations (open to maintainer guidance):
- weights load on the main transformer device; block-swap-aware placement and
  LoRA-merge for quantized layers follow the existing "no merge for quantized
  weights" rule and are left as follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The initial loader bound QuantizedTensor weights but the forward pass failed
for NVFP4 layers with `mat1 and mat2 shapes cannot be multiplied (Mx5120 and
2560x5120)`. Three NVFP4-specific issues (FP8 worked because its packed width
equals its logical width):

- comfy_quant_linear: derive the logical shape from the packed qdata + packing
  factor (in = qdata.shape[1]*2 for NVFP4, *1 for FP8) instead of trusting
  module.in_features. The model's Linear is instantiated from the checkpoint's
  stored weight width; for NVFP4 that's the packed half-width (2560), so using
  it for Params.orig_shape made dequantize() return a half-width tensor and the
  GEMM failed. This is the root cause.

- comfy_quant_linear: route quantized CustomLinear through _linear_forward_direct
  so plain F.linear -> aten.linear.default dispatches to comfy_kitchen's NVFP4/FP8
  GEMM, instead of the wanvideo.linear_forward custom op (a custom-op boundary can
  strip the tensor subclass). Also clear scale_weight / is_gguf on those modules.

- custom_linear: _prepare_weight returns the QuantizedTensor intact for comfy_quant
  layers; a `.to(input)` cast there is unnecessary and risks collapsing it.

- nodes_model_loading: skip the `.weight_scale`->`.scale_weight` rename for
  comfy_quant checkpoints, which keep their own scale tensor names.

Validated end-to-end: a real Wan2.2-Animate generation (ViTPose pose+face
conditioning, 20-step) on the NVFP4-mixed checkpoint matches the FP8 baseline at
27 dB PSNR / 0.94 correlation, visually identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mattneel

mattneel commented Jun 3, 2026

Copy link
Copy Markdown
Author

Update — 3317577: validated end-to-end, not just loading.

The initial commit bound the QuantizedTensors but NVFP4 layers still failed in the forward with mat1 and mat2 shapes cannot be multiplied (Mx5120 and 2560x5120). The follow-up fixes three NVFP4-specific issues (FP8 worked throughout because its packed width equals its logical width):

  1. Root cause — packed vs logical width. The wrapper instantiates each Linear from the checkpoint's stored weight shape. NVFP4 packs two FP4 values per uint8, so the stored weight is (out, in/2)module.in_features is the packed half-width (2560 for a 5120-wide layer). Using it for Params.orig_shape made dequantize() return a half-width tensor and the GEMM failed. Fix: derive the logical shape from qdata.shape[1] * (2 if nvfp4 else 1).
  2. Dispatch. Quantized CustomLinear now uses _linear_forward_direct (plain F.linearaten.linear.default → comfy_kitchen NVFP4/FP8 GEMM) instead of the wanvideo.linear_forward custom op — a torch.library.custom_op boundary can strip the tensor subclass so __torch_dispatch__ never fires.
  3. _prepare_weight keeps the QuantizedTensor intact for comfy_quant layers (no .to(input) collapse), and the .weight_scale.scale_weight rename is skipped for these checkpoints.

Validation (RTX 5090, torch 2.12+cu130, comfy_kitchen): a real Wan2.2-Animate-14B generation — ViTPose pose+face conditioning, 832×480, 49 frames, 20-step / cfg 6, seed 42 — comparing an NVFP4-mixed checkpoint (238 NVFP4 + 242 FP8 layers, ~15 GB) against an all-FP8 build (~18 GB). Same seed + identical conditioning → 27.2 dB mean PSNR, 0.94 pixel correlation, ~1.6% mean delta, stable across all 49 frames; outputs are visually indistinguishable and the amplified difference is high-frequency only. ~17.3 GB peak VRAM.

Scope / not yet tested: single-window animation mode, no LoRA (the unmerged WanVideoSetLoRAs path should apply to quantized weights like scaled-fp8 does, but I haven't verified it with NVFP4); block-swap with NVFP4 is untested post-fix. Checkpoints were produced with comfy's own TensorCoreNVFP4Layout.quantize, so the on-disk layout matches what MixedPrecisionOps loads.

AB2_contact_sheet
nvfp4_vs_fp8_sidebyside.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant