Fix apply_compile called multiple times in PP initialization #2135

xmfan · 2025-12-10T04:20:14Z

Stacked PRs:

->Fix apply_compile called multiple times in PP initialization #2135

PP initialization calls apply_compile multiple times, once per pp stage. But apply_compile does some global patching. So I add already_patched to avoid patching the same method multiple times.

If we patch multiple times, the second time will wrap _run_experts_grouped_mm_dynamic in a torch.compile(fullgraph=True) leading to the error in the issue below.

FIXES #2124

stack-info: PR: #2135, branch: xmfan/stack/8

tianyu-l · 2025-12-10T05:44:12Z

torchtitan/models/llama4/infra/parallelize.py

+    # Patch some globals only once (apply_compile is called multiple times for PP setup)
+    already_patched = (
+        "_run_experts_grouped_mm_dynamic"
+        in moe_module._run_experts_grouped_mm.__qualname__
    )
+    if not already_patched:


This sounds a temp workaround. Will there be a "permanent" solution?

do you mean (1) the need to mark dynamic or (2) the need to define a global patched method?

(1) afaik marking dynamic is the permanent solution to avoid an initial recompile
(2) patching was chosen to avoid writing this into the model code. Two alternatives:

we could mark dynamic the outputs of token dispatch when ep is enabled

we could have a global parallelize function for pp to put code that can only run once

tianyu-l

we should add a test in the integration test to guard against the repro in #2124

I'm a bit confused why we were not hitting this error in https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/models.py#L96

tianyu-l · 2025-12-11T10:17:09Z

torchtitan/models/llama4/infra/parallelize.py

+                num_tokens_per_expert: torch.Tensor,
+            ) -> torch.Tensor:
+                # dynamic number of tokens in expert parallel
+                torch._dynamo.mark_dynamic(x, 0)


maybe not relevant to this PR: how are you going to deal with dynamism in aot approach?

depends on the finalized API, you could do explicit dynamic shapes annotations like here, and error in guards evaluation when unexpected dynamic shapes are encountered

xmfan requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 10, 2025 04:20

xmfan added a commit that referenced this pull request Dec 10, 2025

Fix apply_compile called multiple times in PP initialization

61bc23f

stack-info: PR: #2135, branch: xmfan/stack/8

xmfan force-pushed the xmfan/stack/8 branch from 7e3528c to 61bc23f Compare December 10, 2025 04:20

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 10, 2025

xmfan added a commit that referenced this pull request Dec 10, 2025

Fix apply_compile called multiple times in PP initialization

0c5cd6f

stack-info: PR: #2135, branch: xmfan/stack/8

xmfan force-pushed the xmfan/stack/8 branch from 61bc23f to 0c5cd6f Compare December 10, 2025 04:23

Fix apply_compile called multiple times in PP initialization

2087533

stack-info: PR: #2135, branch: xmfan/stack/8

xmfan force-pushed the xmfan/stack/8 branch from 0c5cd6f to 2087533 Compare December 10, 2025 04:29

tianyu-l reviewed Dec 10, 2025

View reviewed changes

tianyu-l reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix apply_compile called multiple times in PP initialization #2135

Fix apply_compile called multiple times in PP initialization #2135

xmfan commented Dec 10, 2025 •

edited

Loading

Uh oh!

tianyu-l Dec 10, 2025

Uh oh!

xmfan Dec 10, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Dec 11, 2025

Uh oh!

xmfan Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix apply_compile called multiple times in PP initialization #2135

Are you sure you want to change the base?

Fix apply_compile called multiple times in PP initialization #2135

Conversation

xmfan commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmfan commented Dec 10, 2025 •

edited

Loading

xmfan Dec 10, 2025 •

edited

Loading

xmfan Dec 11, 2025 •

edited

Loading