Fixes build for MXFP8 quantize #2781

syed-ahmed · 2025-08-15T22:48:46Z

__CUDA_ARCH__ has undefined behavior in host code and should only be used in device code. Without this PR, we run into the following error:

NotImplementedError: "cat_cuda" not implemented for 'Float8_e8m0fnu'

Essentially, the mxfp8_quantize_kernel kernels are not launched.

Compilation command: TORCH_CUDA_ARCH_LIST="9.0a 10.0a 12.0a 7.5 8.0 8.6 9.0 10.0 12.0+PTX" pip install --no-build-isolation . -vvv
Before this PR, the compilation on B200 looks like:

[1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/torchao/csrc/cu
da/mx_kernels/mxfp8_cuda.o.d -Itorchao/csrc/cuda/mx_kernels -I/usr/local/cuda-12.8/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packa
ges/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu -o /opt/pytorch/ao/build/te
mp.linux-x86_64-cpython-312/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OP
ERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/a
o/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLA
SS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi101
6"' -DTORCH_EXTENSION_NAME=mxfp8_cuda -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,
code=sm_120 -gencode=arch=compute_120a,code=sm_120a -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,
code=sm_90 -gencode=arch=compute_90a,code=sm_90a 


In file included from /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu:3:                                                                                                    
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh:29:2: warning: #warning "MXFP8 quantization requires SM90+ (Hopper) or SM100+ (Blackwell) architecture. Kernel will be 
disabled for this architecture." [-Wcpp]                                                                                                                                                 
     29 | #warning                                                                       \                                                                                               
        |  ^~~~~~~                                                                                                                                                                       
  In file included from /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu:3:                                                                                                    
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh:29:2: warning: #warning "MXFP8 quantization requires SM90+ (Hopper) or SM100+ (Blackwell) architecture. Kernel will be 
disabled for this architecture." [-Wcpp]

And after:

[1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o.d
 -Itorchao/csrc/cuda/mx_kernels -I/usr/local/cuda-12.8/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc
/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu -o /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/
torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-con
stexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/
util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --fte
mplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_120,code=sm_120 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB
="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=mxfp8_cuda
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh(489): warning #940-D: missing return statement at end of non-void function "quantize_block<OType,NUM_VALUES,ScalingMode
>(float, e8m0_t &, const float (&)[NUM_VALUES], OType (&)[NUM_VALUES]) [with OType=fp8e4m3, NUM_VALUES=16, ScalingMode=ScaleCalculationMode::FLOOR]"
    }
    ^

pytorch-bot · 2025-08-15T22:48:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2781

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit c2ae2eb with merge base 69e71d9 ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-08-15T23:08:14Z

This is odd, we run microbenchmarks and e2e training benchmarks using mxfp8 with dim1 cast CUDA kernel and it builds successfully, and have never seen that error.

To double check, I just did a fresh build on latest commit on main with command USE_CPP=1 pip install -e . --verbose it built successfully and I then benchmarked this kernel successfully (logs:(https://www.internalfb.com/phabricator/paste/view/P1906184272).

I'm using torch cuda 12.8, here are my env vars: https://www.internalfb.com/phabricator/paste/view/P1906186247

CUDA_ARCH has undefined behavior in host code and should only be used in device code

I have a question about this - we use this in .cuh files where the actual kernels are implemented. Does nvcc process preprocessor directives on the CPU first before doing compilation of the resulting source code on the device?

danielvegamyhre · 2025-08-15T23:41:53Z

setup.py

@@ -634,6 +634,10 @@ def get_extensions():
        mxfp8_src_files_exist = all(os.path.exists(f) for f in mxfp8_sources)
        if mxfp8_src_files_exist and build_for_sm100a:
            print("Building mxfp8_cuda extension")
+            arch_flags = [


In your compile logs from before this change in the PR description, I already see the proper gencodes:

... -DTORCH_EXTENSION_NAME=mxfp8_cuda -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120, code=sm_120 -gencode=arch=compute_120a,code=sm_120a -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90, code=sm_90 -gencode=arch=compute_90a,code=sm_90a

The proper gencodes are there but the kernel compilation was still getting skipped... If you remove CUDA_ARCH, you'll see that we get other errors like instructions not available in SM90 etc, since all these gencodes are being passed. So that's why needed to conditionally add the gencodes through setup.py to specific source files, just like it's been done for other files in setup.py.

danielvegamyhre · 2025-08-15T23:56:44Z

This warning doesn't impact functionality but still is worth fixing, I will submit a PR:

missing return statement at end of non-void function "quantize_block

I took a look and the function incorrectly has a float return type when it should be void.

danielvegamyhre · 2025-08-16T00:39:03Z

FYI I just made #2782 to fix the other warning I see in your build log (warning #940-D: missing return statement at end of non-void function "quantize_block<OType,NUM_VALUES,ScalingMode)

syed-ahmed · 2025-08-18T20:36:13Z

I forgot to mention this was with CUDA 13. May be with CUDA 12.8, we don't see this error (I'll verify soon). Regardless, the __CUDA_ARCH__ usage before mxfp8_quantize_kernel is undefined behavior and "The host code (the non-GPU code) must not depend on it." per https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=__CUDA_ARCH__#virtual-architecture-macros.

Fixes build for MXFP8 quantize

c2ae2eb

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2025

danielvegamyhre self-requested a review August 15, 2025 23:39

danielvegamyhre reviewed Aug 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes build for MXFP8 quantize #2781

Fixes build for MXFP8 quantize #2781

syed-ahmed commented Aug 15, 2025

Uh oh!

pytorch-bot bot commented Aug 15, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Aug 15, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Aug 15, 2025

Uh oh!

syed-ahmed Aug 18, 2025

Uh oh!

danielvegamyhre commented Aug 15, 2025

Uh oh!

danielvegamyhre commented Aug 16, 2025

Uh oh!

syed-ahmed commented Aug 18, 2025

Uh oh!

Uh oh!

Fixes build for MXFP8 quantize #2781

Are you sure you want to change the base?

Fixes build for MXFP8 quantize #2781

Conversation

syed-ahmed commented Aug 15, 2025

Uh oh!

pytorch-bot bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2781

❌ 2 New Failures

Uh oh!

danielvegamyhre commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Aug 15, 2025

Uh oh!

danielvegamyhre commented Aug 16, 2025

Uh oh!

syed-ahmed commented Aug 18, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 15, 2025 •

edited

Loading

danielvegamyhre commented Aug 15, 2025 •

edited

Loading