Skip to content

Conversation

ruisizhang123
Copy link
Contributor

@ruisizhang123 ruisizhang123 commented Oct 2, 2025

a minimal repro of the loss & perf discrepancy. We don't consider any bucketing strategy here, just enable/disable run_with_post_grad_graph in this PR. (note: if the code runs perfectly in vllm. my suspect is sth is wrong with bwd lolll)

  1. Loss mismatch repro:

Run:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name llama3_simple_fsdp --compile.enable

Set torch._inductor.config.run_with_post_grad_graph in torchtitan/experiments/simple_fsdp/parallelize.py to True (Run without inductor generated code)

you will see the loss as follows

[rank0]:[titan] 2025-10-02 10:44:15,976 - root - INFO - step:  1  loss:  8.0413  grad_norm: 1008.2104  memory:  1.12GiB(1.18%)  tps: 2,610  tflops: 0.19  mfu: 0.02%
[rank0]:[titan] 2025-10-02 10:44:15,976 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-10-02 10:44:16,089 - root - INFO - step:  2  loss:  8.0111  grad_norm: 2087.1572  memory:  1.12GiB(1.18%)  tps: 146,261  tflops: 10.46  mfu: 1.06%
[rank0]:[titan] 2025-10-02 10:44:16,201 - root - INFO - step:  3  loss:  7.9407  grad_norm: 1824.3092  memory:  1.12GiB(1.18%)  tps: 146,725  tflops: 10.49  mfu: 1.06%
[rank0]:[titan] 2025-10-02 10:44:16,308 - root - INFO - step:  4  loss:  7.9020  grad_norm: 1709.5800  memory:  1.12GiB(1.18%)  tps: 153,378  tflops: 10.97  mfu: 1.11%
[rank0]:[titan] 2025-10-02 10:44:16,416 - root - INFO - step:  5  loss:  7.8753  grad_norm: 2048.0801  memory:  1.12GiB(1.18%)  tps: 151,236  tflops: 10.82  mfu: 1.09%
[rank0]:[titan] 2025-10-02 10:44:16,527 - root - INFO - step:  6  loss:  7.8566  grad_norm: 2583.4026  memory:  1.12GiB(1.18%)  tps: 148,250  tflops: 10.60  mfu: 1.07%
[rank0]:[titan] 2025-10-02 10:44:16,654 - root - INFO - step:  7  loss:  7.8540  grad_norm: 2905.2500  memory:  1.12GiB(1.18%)  tps: 129,816  tflops: 9.28  mfu: 0.94%
[rank0]:[titan] 2025-10-02 10:44:16,759 - root - INFO - step:  8  loss:  7.8351  grad_norm: 2924.6506  memory:  1.12GiB(1.18%)  tps: 156,990  tflops: 11.23  mfu: 1.14%
[rank0]:[titan] 2025-10-02 10:44:16,862 - root - INFO - step:  9  loss:  7.8086  grad_norm: 2809.4810  memory:  1.12GiB(1.18%)  tps: 158,688  tflops: 11.35  mfu: 1.15%
[rank0]:[titan] 2025-10-02 10:44:16,973 - root - INFO - step: 10  loss:  7.7860  grad_norm: 2678.9766  memory:  1.12GiB(1.18%)  tps: 147,914  tflops: 10.58  mfu: 1.07%

Set torch._inductor.config.run_with_post_grad_graph in torchtitan/experiments/simple_fsdp/parallelize.py to False (Run with inductor generated code)

you will see the loss as follows

[rank0]:[titan] 2025-10-02 10:47:24,769 - root - INFO - step:  1  loss:  8.0408  grad_norm:  1.4127  memory:  0.92GiB(0.97%)  tps: 1,387  tflops: 0.10  mfu: 0.01%
[rank0]:[titan] 2025-10-02 10:47:24,769 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-10-02 10:47:24,814 - root - INFO - step:  2  loss:  7.7303  grad_norm:  1.4808  memory:  0.99GiB(1.04%)  tps: 368,892  tflops: 26.38  mfu: 2.67%
[rank0]:[titan] 2025-10-02 10:47:24,851 - root - INFO - step:  3  loss:  7.0050  grad_norm:  1.8650  memory:  0.99GiB(1.04%)  tps: 441,174  tflops: 31.55  mfu: 3.19%
[rank0]:[titan] 2025-10-02 10:47:24,891 - root - INFO - step:  4  loss:  6.1054  grad_norm:  2.3362  memory:  0.99GiB(1.04%)  tps: 414,553  tflops: 29.65  mfu: 3.00%
[rank0]:[titan] 2025-10-02 10:47:24,932 - root - INFO - step:  5  loss:  5.2373  grad_norm:  2.4162  memory:  0.99GiB(1.04%)  tps: 396,680  tflops: 28.37  mfu: 2.87%
[rank0]:[titan] 2025-10-02 10:47:24,976 - root - INFO - step:  6  loss:  4.6744  grad_norm:  2.2496  memory:  0.99GiB(1.04%)  tps: 376,303  tflops: 26.91  mfu: 2.72%
[rank0]:[titan] 2025-10-02 10:47:25,017 - root - INFO - step:  7  loss:  4.3670  grad_norm:  2.1542  memory:  0.99GiB(1.04%)  tps: 406,478  tflops: 29.07  mfu: 2.94%
[rank0]:[titan] 2025-10-02 10:47:25,057 - root - INFO - step:  8  loss:  4.1888  grad_norm:  1.9387  memory:  0.99GiB(1.04%)  tps: 412,255  tflops: 29.48  mfu: 2.98%
[rank0]:[titan] 2025-10-02 10:47:25,097 - root - INFO - step:  9  loss:  4.1742  grad_norm:  1.7265  memory:  0.99GiB(1.04%)  tps: 409,988  tflops: 29.32  mfu: 2.96%
[rank0]:[titan] 2025-10-02 10:47:25,137 - root - INFO - step: 10  loss:  4.0241  grad_norm:  1.6893  memory:  0.99GiB(1.04%)  tps: 409,462  tflops: 29.28  mfu: 2.96%
  1. You can reproduce the perf mismatch by running
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml"  ./run_train.sh --model.name llama3_simple_fsdp --compile.enable 

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025
@ruisizhang123 ruisizhang123 force-pushed the ruisi/autoparallel_debug branch from c62d60f to 6f07dfa Compare October 2, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant