[debug don't merge] pr to reproduce error in aot_eager+post_grad_custom_post_pass #1785

ruisizhang123 · 2025-10-02T17:50:01Z

a minimal repro of the loss & perf discrepancy. We don't consider any bucketing strategy here, just enable/disable run_with_post_grad_graph in this PR. (note: if the code runs perfectly in vllm. my suspect is sth is wrong with bwd lolll)

Loss mismatch repro:

Run:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name llama3_simple_fsdp --compile.enable

Set torch._inductor.config.run_with_post_grad_graph in torchtitan/experiments/simple_fsdp/parallelize.py to True (Run without inductor generated code)

you will see the loss as follows

[rank0]:[titan] 2025-10-02 10:44:15,976 - root - INFO - step:  1  loss:  8.0413  grad_norm: 1008.2104  memory:  1.12GiB(1.18%)  tps: 2,610  tflops: 0.19  mfu: 0.02%
[rank0]:[titan] 2025-10-02 10:44:15,976 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-10-02 10:44:16,089 - root - INFO - step:  2  loss:  8.0111  grad_norm: 2087.1572  memory:  1.12GiB(1.18%)  tps: 146,261  tflops: 10.46  mfu: 1.06%
[rank0]:[titan] 2025-10-02 10:44:16,201 - root - INFO - step:  3  loss:  7.9407  grad_norm: 1824.3092  memory:  1.12GiB(1.18%)  tps: 146,725  tflops: 10.49  mfu: 1.06%
[rank0]:[titan] 2025-10-02 10:44:16,308 - root - INFO - step:  4  loss:  7.9020  grad_norm: 1709.5800  memory:  1.12GiB(1.18%)  tps: 153,378  tflops: 10.97  mfu: 1.11%
[rank0]:[titan] 2025-10-02 10:44:16,416 - root - INFO - step:  5  loss:  7.8753  grad_norm: 2048.0801  memory:  1.12GiB(1.18%)  tps: 151,236  tflops: 10.82  mfu: 1.09%
[rank0]:[titan] 2025-10-02 10:44:16,527 - root - INFO - step:  6  loss:  7.8566  grad_norm: 2583.4026  memory:  1.12GiB(1.18%)  tps: 148,250  tflops: 10.60  mfu: 1.07%
[rank0]:[titan] 2025-10-02 10:44:16,654 - root - INFO - step:  7  loss:  7.8540  grad_norm: 2905.2500  memory:  1.12GiB(1.18%)  tps: 129,816  tflops: 9.28  mfu: 0.94%
[rank0]:[titan] 2025-10-02 10:44:16,759 - root - INFO - step:  8  loss:  7.8351  grad_norm: 2924.6506  memory:  1.12GiB(1.18%)  tps: 156,990  tflops: 11.23  mfu: 1.14%
[rank0]:[titan] 2025-10-02 10:44:16,862 - root - INFO - step:  9  loss:  7.8086  grad_norm: 2809.4810  memory:  1.12GiB(1.18%)  tps: 158,688  tflops: 11.35  mfu: 1.15%
[rank0]:[titan] 2025-10-02 10:44:16,973 - root - INFO - step: 10  loss:  7.7860  grad_norm: 2678.9766  memory:  1.12GiB(1.18%)  tps: 147,914  tflops: 10.58  mfu: 1.07%

Set torch._inductor.config.run_with_post_grad_graph in torchtitan/experiments/simple_fsdp/parallelize.py to False (Run with inductor generated code)

you will see the loss as follows

[rank0]:[titan] 2025-10-02 10:47:24,769 - root - INFO - step:  1  loss:  8.0408  grad_norm:  1.4127  memory:  0.92GiB(0.97%)  tps: 1,387  tflops: 0.10  mfu: 0.01%
[rank0]:[titan] 2025-10-02 10:47:24,769 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-10-02 10:47:24,814 - root - INFO - step:  2  loss:  7.7303  grad_norm:  1.4808  memory:  0.99GiB(1.04%)  tps: 368,892  tflops: 26.38  mfu: 2.67%
[rank0]:[titan] 2025-10-02 10:47:24,851 - root - INFO - step:  3  loss:  7.0050  grad_norm:  1.8650  memory:  0.99GiB(1.04%)  tps: 441,174  tflops: 31.55  mfu: 3.19%
[rank0]:[titan] 2025-10-02 10:47:24,891 - root - INFO - step:  4  loss:  6.1054  grad_norm:  2.3362  memory:  0.99GiB(1.04%)  tps: 414,553  tflops: 29.65  mfu: 3.00%
[rank0]:[titan] 2025-10-02 10:47:24,932 - root - INFO - step:  5  loss:  5.2373  grad_norm:  2.4162  memory:  0.99GiB(1.04%)  tps: 396,680  tflops: 28.37  mfu: 2.87%
[rank0]:[titan] 2025-10-02 10:47:24,976 - root - INFO - step:  6  loss:  4.6744  grad_norm:  2.2496  memory:  0.99GiB(1.04%)  tps: 376,303  tflops: 26.91  mfu: 2.72%
[rank0]:[titan] 2025-10-02 10:47:25,017 - root - INFO - step:  7  loss:  4.3670  grad_norm:  2.1542  memory:  0.99GiB(1.04%)  tps: 406,478  tflops: 29.07  mfu: 2.94%
[rank0]:[titan] 2025-10-02 10:47:25,057 - root - INFO - step:  8  loss:  4.1888  grad_norm:  1.9387  memory:  0.99GiB(1.04%)  tps: 412,255  tflops: 29.48  mfu: 2.98%
[rank0]:[titan] 2025-10-02 10:47:25,097 - root - INFO - step:  9  loss:  4.1742  grad_norm:  1.7265  memory:  0.99GiB(1.04%)  tps: 409,988  tflops: 29.32  mfu: 2.96%
[rank0]:[titan] 2025-10-02 10:47:25,137 - root - INFO - step: 10  loss:  4.0241  grad_norm:  1.6893  memory:  0.99GiB(1.04%)  tps: 409,462  tflops: 29.28  mfu: 2.96%

You can reproduce the perf mismatch by running

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml"  ./run_train.sh --model.name llama3_simple_fsdp --compile.enable

ruisizhang123 requested review from tianyu-l, fegin, wwwjn and wconstab as code owners October 2, 2025 17:50

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025

ruisizhang123 force-pushed the ruisi/autoparallel_debug branch from c62d60f to 6f07dfa Compare October 2, 2025 17:54

add atenbucketing pass

6e2a44f

ruisizhang123 force-pushed the ruisi/autoparallel_debug branch from 6f07dfa to 6e2a44f Compare October 2, 2025 17:54

ruisizhang123 requested a review from angelayi October 2, 2025 17:57

angelayi mentioned this pull request Oct 3, 2025

[inductor] Add flag to run post-grad graph pytorch/pytorch#164427

Closed

ruisizhang123 closed this Oct 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[debug don't merge] pr to reproduce error in aot_eager+post_grad_custom_post_pass #1785

[debug don't merge] pr to reproduce error in aot_eager+post_grad_custom_post_pass #1785

ruisizhang123 commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

[debug don't merge] pr to reproduce error in aot_eager+post_grad_custom_post_pass #1785

[debug don't merge] pr to reproduce error in aot_eager+post_grad_custom_post_pass #1785

Conversation

ruisizhang123 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ruisizhang123 commented Oct 2, 2025 •

edited

Loading