Skip to content

Fixes build for MXFP8 quantize #2781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

syed-ahmed
Copy link
Contributor

__CUDA_ARCH__ has undefined behavior in host code and should only be used in device code. Without this PR, we run into the following error:

NotImplementedError: "cat_cuda" not implemented for 'Float8_e8m0fnu'

Essentially, the mxfp8_quantize_kernel kernels are not launched.

Compilation command: TORCH_CUDA_ARCH_LIST="9.0a 10.0a 12.0a 7.5 8.0 8.6 9.0 10.0 12.0+PTX" pip install --no-build-isolation . -vvv
Before this PR, the compilation on B200 looks like:

[1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/torchao/csrc/cu
da/mx_kernels/mxfp8_cuda.o.d -Itorchao/csrc/cuda/mx_kernels -I/usr/local/cuda-12.8/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packa
ges/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu -o /opt/pytorch/ao/build/te
mp.linux-x86_64-cpython-312/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OP
ERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/a
o/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLA
SS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi101
6"' -DTORCH_EXTENSION_NAME=mxfp8_cuda -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,
code=sm_120 -gencode=arch=compute_120a,code=sm_120a -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,
code=sm_90 -gencode=arch=compute_90a,code=sm_90a 


In file included from /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu:3:                                                                                                    
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh:29:2: warning: #warning "MXFP8 quantization requires SM90+ (Hopper) or SM100+ (Blackwell) architecture. Kernel will be 
disabled for this architecture." [-Wcpp]                                                                                                                                                 
     29 | #warning                                                                       \                                                                                               
        |  ^~~~~~~                                                                                                                                                                       
  In file included from /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu:3:                                                                                                    
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh:29:2: warning: #warning "MXFP8 quantization requires SM90+ (Hopper) or SM100+ (Blackwell) architecture. Kernel will be 
disabled for this architecture." [-Wcpp]


And after:

[1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o.d
 -Itorchao/csrc/cuda/mx_kernels -I/usr/local/cuda-12.8/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc
/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu -o /opt/pytorch/ao/build/temp.linux-x86_64-cpython-312/
torchao/csrc/cuda/mx_kernels/mxfp8_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-con
stexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/
util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --fte
mplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_120,code=sm_120 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB
="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=mxfp8_cuda
  /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh(489): warning #940-D: missing return statement at end of non-void function "quantize_block<OType,NUM_VALUES,ScalingMode
>(float, e8m0_t &, const float (&)[NUM_VALUES], OType (&)[NUM_VALUES]) [with OType=fp8e4m3, NUM_VALUES=16, ScalingMode=ScaleCalculationMode::FLOOR]"
    }
    ^

Copy link

pytorch-bot bot commented Aug 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2781

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit c2ae2eb with merge base 69e71d9 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2025
@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Aug 15, 2025

This is odd, we run microbenchmarks and e2e training benchmarks using mxfp8 with dim1 cast CUDA kernel and it builds successfully, and have never seen that error.

To double check, I just did a fresh build on latest commit on main with command USE_CPP=1 pip install -e . --verbose it built successfully and I then benchmarked this kernel successfully (logs:(https://www.internalfb.com/phabricator/paste/view/P1906184272).

I'm using torch cuda 12.8, here are my env vars: https://www.internalfb.com/phabricator/paste/view/P1906186247

CUDA_ARCH has undefined behavior in host code and should only be used in device code

I have a question about this - we use this in .cuh files where the actual kernels are implemented. Does nvcc process preprocessor directives on the CPU first before doing compilation of the resulting source code on the device?

@danielvegamyhre danielvegamyhre self-requested a review August 15, 2025 23:39
@@ -634,6 +634,10 @@ def get_extensions():
mxfp8_src_files_exist = all(os.path.exists(f) for f in mxfp8_sources)
if mxfp8_src_files_exist and build_for_sm100a:
print("Building mxfp8_cuda extension")
arch_flags = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your compile logs from before this change in the PR description, I already see the proper gencodes:

...
-DTORCH_EXTENSION_NAME=mxfp8_cuda -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,
code=sm_120 -gencode=arch=compute_120a,code=sm_120a -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,
code=sm_90 -gencode=arch=compute_90a,code=sm_90a 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper gencodes are there but the kernel compilation was still getting skipped... If you remove CUDA_ARCH, you'll see that we get other errors like instructions not available in SM90 etc, since all these gencodes are being passed. So that's why needed to conditionally add the gencodes through setup.py to specific source files, just like it's been done for other files in setup.py.

@danielvegamyhre
Copy link
Contributor

This warning doesn't impact functionality but still is worth fixing, I will submit a PR:

missing return statement at end of non-void function "quantize_block

I took a look and the function incorrectly has a float return type when it should be void.

@danielvegamyhre
Copy link
Contributor

FYI I just made #2782 to fix the other warning I see in your build log (warning #940-D: missing return statement at end of non-void function "quantize_block<OType,NUM_VALUES,ScalingMode)

@syed-ahmed
Copy link
Contributor Author

I forgot to mention this was with CUDA 13. May be with CUDA 12.8, we don't see this error (I'll verify soon). Regardless, the __CUDA_ARCH__ usage before mxfp8_quantize_kernel is undefined behavior and "The host code (the non-GPU code) must not depend on it." per https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=__CUDA_ARCH__#virtual-architecture-macros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants