[draft] NVFP4 block-16 scale support for SM90 mixed-input grouped GEMM by changjonathanc · Pull Request #1 · poolsideai/cutlass

changjonathanc · 2026-06-08T18:09:14Z

Summary

Adds optional NVFP4 (e2m1 data + e4m3 scale) block-16 scaling to the SM90 warp-specialized mixed-input collectives, so weight-scaled W4A8 grouped GEMM works when the scale block size (16) is smaller than the GMMA K tile. +69/-12 across two headers; all new behaviour is gated by UseNvfp4Block16Scales/Broadcast, which is false for every pre-existing instantiation — so existing kernels are byte-for-byte unaffected.

Currently this patch is carried out-of-tree in Forge as full-file cutlass_overrides/ copies (shadowing a second stock CUTLASS 4.4.2 submodule). That copies ~2,700 lines of stock CUTLASS to deliver these ~85 patched lines. Landing it in the fork lets Colonels drop both the overrides and the second submodule and build against this fork alone (matching how the existing _C extension already builds).

Changes

sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp

UseNvfp4Block16Scales / ScaleAtomM / ScaleAtomK gate the new path (e2m1 A, e4m3 scale, TileK % 16 == 0); SmemLayoutAtomScale gains multiple scale columns per K tile in that case.
Relax the size<1>(SmemLayoutAtomScale)==1 static_assert for that path only.
Clamp grouped-GEMM init M/N/K to the tile shape; set an explicit StrideScale.
Relax the can_implement chunk-size check to also allow TileK % chunk_size == 0 for NVFP4.

mixed_input_utils.hpp

UseNvfp4Block16ScaleBroadcast + get_mma_smem_layout_scale() build a broadcast MMA view over the compact scale columns (stride-0 across the 16-row atom).
Refresh scales every k-block (not only k_block==0) when there are multiple scale columns; route scale/zero smem tensors through the broadcast layout.

Validation

Built against this branch (CUTLASS 4.3.5 base) on H200 with the Colonels W4A8 MoE extension (no cutlass_overrides, no second submodule). On a real NVFP4 laguna-xs checkpoint, W4A8 prompt logprobs are bit-identical to the previous build that used stock 4.4.2 + the override copies (mean|Δ|=0, max|Δ|=0, Pearson=1.0), and large-batch MoE speedup is preserved (1.3–1.7× vs Marlin at M=16k–32k). So 4.3.5 is sufficient — no CUTLASS version bump required.

🤖 Generated with Claude Code

Adds optional NVFP4 (e2m1 data + e4m3 scale) block-16 scaling to the SM90 warp-specialized mixed-input collectives, enabling weight-scaled W4A8 grouped GEMM where the scale block size (16) is smaller than the GMMA K tile. sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp: - UseNvfp4Block16Scales / ScaleAtomM / ScaleAtomK gate the new path (e2m1 A, e4m3 scale, TileK % 16 == 0); SmemLayoutAtomScale gains multiple scale columns per K tile in that case. - Relax the "size<1>(SmemLayoutAtomScale)==1" static_assert for that path. - Clamp grouped-GEMM init M/N/K to the tile shape; set an explicit StrideScale. - Relax can_implement chunk-size check to allow TileK % chunk_size == 0 for NVFP4. mixed_input_utils.hpp: - UseNvfp4Block16ScaleBroadcast + get_mma_smem_layout_scale() build a broadcast MMA view over the compact scale columns (stride-0 over the 16-row atom). - Refresh scales every k-block (not just k_block==0) when there are multiple scale columns; route the scale/zero smem tensors through the broadcast layout. No change to existing kernels: all new behaviour is guarded by UseNvfp4Block16Scales/Broadcast, which is false for every pre-existing instantiation. Used by the Colonels NVFP4 W4A8 MoE kernels.

changjonathanc · 2026-06-09T15:58:37Z

Superseded by the 4.4.2-based PR: building W4A8 against the 4.3.5 base is a ~5.3x perf regression vs 4.4.2 (measured on H200). Redone on a v4.4.2 base.

changjonathanc closed this Jun 9, 2026

changjonathanc mentioned this pull request Jun 9, 2026

[draft] NVFP4 block-16 scale support for SM90 mixed-input grouped GEMM (CUTLASS 4.4.2) #2

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] NVFP4 block-16 scale support for SM90 mixed-input grouped GEMM#1

[draft] NVFP4 block-16 scale support for SM90 mixed-input grouped GEMM#1
changjonathanc wants to merge 1 commit into
overlapped-all-to-all-support-0d2b201efrom
nvfp4-block16-mixed-input

changjonathanc commented Jun 8, 2026

Uh oh!

changjonathanc commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

changjonathanc commented Jun 8, 2026

Summary

Changes

Validation

Uh oh!

changjonathanc commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant