JointGraph-based Training Prototype #1794

SherlockNoMad · 2025-10-03T07:17:48Z

This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow.

Setup: shard_dp = 2, tp = 4.

MVP

[Done] Start with a simpleFSDP model, enable TP + FSDP
[Done] Apply aot_export_joing_with_descriptor on parallelized module with DTensor input to get the joint graph
[Done] Apply min_cut_partitioner to get forward and backward graph module
[Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives
[In Progress] Run the fw and bw graph with JointGraphRunner

Issues

fwd_rng_state show up in the aot_export_joint grpah input pytorch#164559
[DTensor] Improve Sharding propagation error message pytorch#164543
What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong.

Repro steps:
Needs to be run on top of pytorch/pytorch#164557
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Sample output:
P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py

xmfan and others added 5 commits October 2, 2025 12:32

Fork SimpleFSDP

732f0ca

Hijack the execution flow for a single training loop

2d047a6

Introduce Joint Graph Runner

4f8a0ba

convert inputs into DTensor

3919e78

fixes

35ce6a0

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025

SherlockNoMad changed the title ~~Joint Graph Runner~~ JointGraph-based Training Prototype Oct 3, 2025

SherlockNoMad mentioned this pull request Oct 3, 2025

fwd_rng_state show up in the aot_export_joint grpah input pytorch/pytorch#164559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JointGraph-based Training Prototype #1794

JointGraph-based Training Prototype #1794

SherlockNoMad commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

JointGraph-based Training Prototype #1794

Are you sure you want to change the base?

JointGraph-based Training Prototype #1794

Conversation

SherlockNoMad commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SherlockNoMad commented Oct 3, 2025 •

edited

Loading