integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

danielvegamyhre · 2025-08-17T16:16:57Z

Stacked PRs:

[fp8 blockwise] wrap triton quantization kernels in custom ops for torch.compile compatibility #2829
[fp8 blockwise] load 2d chunks for groupwise quant to enable coalesced gmem accesses #2827
use shared bench + profile utils in blockwise fwd bwd bench script #2826
->integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script

Summary

Add use_triton flag to Float8BlockwiseLinear` that defaults to False. When True, use triton for gemms. When False, use torch._scaled_mm for gemms.
Compute scales in column major format, for compatibility with torch._scaled_mm.
Add some autotuner configs for IO bounds dynamic quantization kernels.
Rename triton kernels to be prefixed with "triton_" for clarity
Write out reciprocal scales directly from triton quant kernels, rather than writing regular scales then doing 1.0/scales to pass in reciprocals to GEMM for rescaling outputs. This is necessary because in some cases the division op (1.0/scales) changes the memory layout from col major to row major, which triggers assertions on mem layout

Benchmarks

Benchmarking an eager linear forward + backward pass with llama4 shapes, using torch._scaled_mm improves perf by ~2x but still slower than bf16. Next step is to look at the trace to see which quantization kernels are slowing things down.

    M     N     K  out_dtype         bf16_mm_linear_us    fp8_triton_linear_us    fp8_scaled_mm_linear_us
-----  ----  ----  --------------  -------------------  ----------------------  -------------------------
16640  5120  8192  torch.bfloat16              5349.79                 12643.5                    7746.98
16640  8192  5120  torch.bfloat16              6100.62                 12912.6                    7783.31

pytorch-bot · 2025-08-17T16:17:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2785

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 3c78e2a with merge base 253d65a ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio fbgemm-gpu-genai --index-url https... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float
Run 1xL4 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

drisspg · 2025-08-20T15:53:54Z

Do we actually need these triton kernels? It seems like scaled_mm is pretty universally, is there a valid case for the small shapes you mentioned previously

I guess the other valid argument is that some people aren't using new enough triton to have this fp8 mm support

danielvegamyhre · 2025-08-20T16:36:04Z

Do we actually need these triton kernels? It seems like scaled_mm is pretty universally, is there a valid case for the small shapes you mentioned previously

I guess the other valid argument is that some people aren't using new enough triton to have this fp8 mm support

No, scaled_mm is definitely better, I just wanted to keep the triton gemms so as a side-project for learning I can try optimizing them to match scaled_mm using additional features like TMA, etc.

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

drisspg · 2025-08-20T22:15:11Z

torchao/prototype/moe_training/scaled_grouped_mm.py

@@ -48,7 +48,7 @@ def _scaled_grouped_mm(
    """
    # TODO: Remove logging once prototype is more mature. This is currently very useful for development and debugging.
    if scaling_type == MoEScalingType.FP8_ROWWISE:
-        # print("Using fp8 rowwise scaled_grouped_mm")
+        print("Using fp8 rowwise scaled_grouped_mm")


Yeah this shouldn't be commented out, this was mistakenly left in a previous PR. So fixing it here.

we shouldnt be printing though.. maybe logging but forsure not printing

Agreed, I had this as logging originally but changed to print when I noticed the logs didn't show up in torchtitan training runs when I was trying to get ScaledGroupedMMTensor to work e2e, and wanted to focus on the task at hand rather than debug whatever torchtitan is doing with the logging module.

I switched back to logging statements in this PR: #2835

@drisspg I landed #2835 and rebased on top of main, so the print statements are logging now.

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre added a commit that referenced this pull request Aug 17, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

fc4fe47

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from 3489e41 to fc4fe47 Compare August 17, 2025 16:17

danielvegamyhre mentioned this pull request Aug 17, 2025

improve fp8 blockwise gemm perf #2784

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 17, 2025

danielvegamyhre marked this pull request as draft August 18, 2025 15:32

danielvegamyhre changed the title ~~integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ [WIP] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 18, 2025

danielvegamyhre changed the title ~~[WIP] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ [fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

danielvegamyhre requested review from vkuzo and drisspg August 20, 2025 00:59

danielvegamyhre marked this pull request as ready for review August 20, 2025 00:59

danielvegamyhre changed the base branch from danielvegamyhre/stack/43 to main August 20, 2025 17:24

danielvegamyhre added a commit that referenced this pull request Aug 20, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

8875c4f

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from fc4fe47 to 8875c4f Compare August 20, 2025 17:24

danielvegamyhre changed the title ~~[fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/43 August 20, 2025 17:24

danielvegamyhre added a commit that referenced this pull request Aug 20, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

07a35b6

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from 8875c4f to 07a35b6 Compare August 20, 2025 17:30

danielvegamyhre changed the base branch from danielvegamyhre/stack/43 to main August 20, 2025 17:30

danielvegamyhre added a commit that referenced this pull request Aug 20, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

ef6fc50

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from 07a35b6 to ef6fc50 Compare August 20, 2025 20:35

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 20, 2025

danielvegamyhre changed the title ~~integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ [fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

danielvegamyhre added a commit that referenced this pull request Aug 20, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

3a81b75

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from ef6fc50 to 3a81b75 Compare August 20, 2025 21:21

danielvegamyhre changed the title ~~[fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

danielvegamyhre changed the title ~~integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ [fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

danielvegamyhre added a commit that referenced this pull request Aug 20, 2025

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

f7a4a59

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from 3a81b75 to f7a4a59 Compare August 20, 2025 21:25

danielvegamyhre changed the title ~~[fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 20, 2025

drisspg reviewed Aug 20, 2025

View reviewed changes

This was referenced Aug 21, 2025

use shared bench + profile utils in blockwise fwd bwd bench script #2826

Open

[fp8 blockwise] load 2d chunks for groupwise quant to enable coalesced gmem accesses #2827

Open

danielvegamyhre changed the title ~~integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ [fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 21, 2025

danielvegamyhre changed the title ~~[fp8 blockwise] integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script~~ integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script Aug 21, 2025

danielvegamyhre mentioned this pull request Aug 21, 2025

[fp8 blockwise] wrap triton quantization kernels in custom ops for torch.compile compatibility #2829

Open

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench s…

3c78e2a

…cript stack-info: PR: #2785, branch: danielvegamyhre/stack/44

danielvegamyhre force-pushed the danielvegamyhre/stack/44 branch from f7a4a59 to 3c78e2a Compare August 22, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

Uh oh!

danielvegamyhre commented Aug 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 17, 2025 •

edited

Loading

Uh oh!

drisspg commented Aug 20, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Aug 20, 2025 •

edited

Loading

Uh oh!

drisspg Aug 20, 2025

Uh oh!

danielvegamyhre Aug 20, 2025

Uh oh!

drisspg Aug 21, 2025

Uh oh!

danielvegamyhre Aug 21, 2025

Uh oh!

danielvegamyhre Aug 22, 2025

Uh oh!

Uh oh!

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

Are you sure you want to change the base?

integrate torch._scaled_mm into Float8BlockwiseLinear and add bench script #2785

Uh oh!

Conversation

danielvegamyhre commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Uh oh!

pytorch-bot bot commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2785

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

drisspg commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented Aug 17, 2025 •

edited

Loading

pytorch-bot bot commented Aug 17, 2025 •

edited

Loading

drisspg commented Aug 20, 2025 •

edited

Loading

danielvegamyhre commented Aug 20, 2025 •

edited

Loading