[not for land] repro for slow inductor codegen in bwd #2768

danielvegamyhre · 2025-08-14T01:00:10Z

Repro steps

Checkout this PR
Run rm -rf /tmp/torchinductor_${USER}; TORCH_TRACE=/tmp/repro-08-13 python benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py --compile --profile

This will benchmark and profile:

torch._grouped_mm forward + backward
torchao _scaled_group_mm autograd func, which uses dynamic fp8 rowwise quantization on the inputs to the grouped gemms, overridding torch._grouped_mm => torch._scaled_grouped_mm.

Profiles of bf16_profile.json and scaled_profile.json will be saved to local working directory and can be viewed in Perfetto.

Note: this PR is a branch off of the stack #2767, and here we simply replace the triton kernel for transposing + quantizing the expert weights, with plain torch code and let inductor codegen it.

torch.compile for quantizing non-transposed expert weights in backward:

A_shape        B_shape             bf16_time_us    scaled_time_us  fp8_speedup
-------------  ----------------  --------------  ----------------  -------------
(16640, 5120)  (16, 8192, 5120)         10692.6           14106.8  0.758x

handwritten kernel for quantizing non-transposed expert weights in backward:

A_shape        B_shape             bf16_time_us    scaled_time_us  fp8_speedup
-------------  ----------------  --------------  ----------------  -------------
(16640, 5120)  (16, 8192, 5120)         10447.9           12657.7  0.825x

As you can see fp8 rowwise currently results in a slowdown either way, but the slowdown is worse when using torch.compile for quantizing expert weights in backward. Even the handwritten kernel is quite slow actually, it only achieves ~36% peak memory bandwidth.

Tlparse link for run using torch.compile for transposed expert weights quant: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpoDQmWo/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

stack-info: PR: #2733, branch: danielvegamyhre/stack/33

…kernels stack-info: PR: #2734, branch: danielvegamyhre/stack/34

stack-info: PR: #2749, branch: danielvegamyhre/stack/35

stack-info: PR: #2756, branch: danielvegamyhre/stack/36

stack-info: PR: #2762, branch: danielvegamyhre/stack/38

…d_grouped_mm fwd+bwd against bf16 stack-info: PR: #2765, branch: danielvegamyhre/stack/40

stack-info: PR: #2767, branch: danielvegamyhre/stack/41

pytorch-bot · 2025-08-14T01:00:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2768

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1904b58 with merge base 317179e ():

NEW FAILURE - The following job has failed:

Code Analysis with Ruff / build (3.9) (gh)
torchao/prototype/moe_training/scaled_grouped_mm.py:18:5: F401 [*] torchao.prototype.moe_training.kernels.triton_fp8_rowwise_3d_transpose_rhs imported but unused

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added 8 commits August 12, 2025 18:42

[moe training] update tests for torchtitan moe refactor

06142a5

stack-info: PR: #2733, branch: danielvegamyhre/stack/33

[moe training] use custom ops instead of wrap_triton for fp8 rowwise …

7017c10

…kernels stack-info: PR: #2734, branch: danielvegamyhre/stack/34

[moe training] fix scaling type bug; refactor distributed tests

4bf025e

stack-info: PR: #2749, branch: danielvegamyhre/stack/35

[moe training] use llama4 shapes for kernel benchmarks

c3b5a00

stack-info: PR: #2756, branch: danielvegamyhre/stack/36

[moe training] remove duplicate benchmark script

f2b671e

stack-info: PR: #2762, branch: danielvegamyhre/stack/38

[moe training] update bench script to compare fp8 dynamic quant scale…

23f4c81

…d_grouped_mm fwd+bwd against bf16 stack-info: PR: #2765, branch: danielvegamyhre/stack/40

[moe training] refactor to share benchmarking and profiling utils

c02d02b

stack-info: PR: #2767, branch: danielvegamyhre/stack/41

repro

1904b58

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2025

danielvegamyhre changed the title ~~[moe training] repro for slow inductor codegen in bwd~~ [not for land] repro for slow inductor codegen in bwd Aug 14, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 14, 2025

danielvegamyhre marked this pull request as draft August 14, 2025 01:00

danielvegamyhre mentioned this pull request Aug 14, 2025

Inductor codegen for float8 dynamic quantization ops for scaled_grouped_mm backward pass is slow pytorch/pytorch#159769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[not for land] repro for slow inductor codegen in bwd #2768

[not for land] repro for slow inductor codegen in bwd #2768

Uh oh!

danielvegamyhre commented Aug 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

[not for land] repro for slow inductor codegen in bwd #2768

Are you sure you want to change the base?

[not for land] repro for slow inductor codegen in bwd #2768

Uh oh!

Conversation

danielvegamyhre commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro steps

Uh oh!

pytorch-bot bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2768

❌ 1 New Failure

Uh oh!

Uh oh!

danielvegamyhre commented Aug 14, 2025 •

edited

Loading

pytorch-bot bot commented Aug 14, 2025 •

edited

Loading