Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Aug 14, 2025

Repro steps

  1. Checkout this PR
  2. Run rm -rf /tmp/torchinductor_${USER}; TORCH_TRACE=/tmp/repro-08-13 python benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py --compile --profile

This will benchmark and profile:

  • torch._grouped_mm forward + backward
  • torchao _scaled_group_mm autograd func, which uses dynamic fp8 rowwise quantization on the inputs to the grouped gemms, overridding torch._grouped_mm => torch._scaled_grouped_mm.

Profiles of bf16_profile.json and scaled_profile.json will be saved to local working directory and can be viewed in Perfetto.

Note: this PR is a branch off of the stack #2767, and here we simply replace the triton kernel for transposing + quantizing the expert weights, with plain torch code and let inductor codegen it.

torch.compile for quantizing non-transposed expert weights in backward:

A_shape        B_shape             bf16_time_us    scaled_time_us  fp8_speedup
-------------  ----------------  --------------  ----------------  -------------
(16640, 5120)  (16, 8192, 5120)         10692.6           14106.8  0.758x

handwritten kernel for quantizing non-transposed expert weights in backward:

A_shape        B_shape             bf16_time_us    scaled_time_us  fp8_speedup
-------------  ----------------  --------------  ----------------  -------------
(16640, 5120)  (16, 8192, 5120)         10447.9           12657.7  0.825x

As you can see fp8 rowwise currently results in a slowdown either way, but the slowdown is worse when using torch.compile for quantizing expert weights in backward. Even the handwritten kernel is quite slow actually, it only achieves ~36% peak memory bandwidth.

Tlparse link for run using torch.compile for transposed expert weights quant: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpoDQmWo/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

stack-info: PR: #2733, branch: danielvegamyhre/stack/33
…kernels

stack-info: PR: #2734, branch: danielvegamyhre/stack/34
stack-info: PR: #2749, branch: danielvegamyhre/stack/35
stack-info: PR: #2756, branch: danielvegamyhre/stack/36
stack-info: PR: #2762, branch: danielvegamyhre/stack/38
…d_grouped_mm fwd+bwd against bf16

stack-info: PR: #2765, branch: danielvegamyhre/stack/40
stack-info: PR: #2767, branch: danielvegamyhre/stack/41
Copy link

pytorch-bot bot commented Aug 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2768

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1904b58 with merge base 317179e (image):

NEW FAILURE - The following job has failed:

  • Code Analysis with Ruff / build (3.9) (gh)
    torchao/prototype/moe_training/scaled_grouped_mm.py:18:5: F401 [*] torchao.prototype.moe_training.kernels.triton_fp8_rowwise_3d_transpose_rhs imported but unused

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2025
@danielvegamyhre danielvegamyhre changed the title [moe training] repro for slow inductor codegen in bwd [not for land] repro for slow inductor codegen in bwd Aug 14, 2025
@danielvegamyhre danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 14, 2025
@danielvegamyhre danielvegamyhre marked this pull request as draft August 14, 2025 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant