Skip to content

Conversation

SherlockNoMad
Copy link

@SherlockNoMad SherlockNoMad commented Oct 3, 2025

This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow.

Setup: shard_dp = 2, tp = 4.

MVP

  • [Done] Start with a simpleFSDP model, enable TP + FSDP
  • [Done] Apply aot_export_joing_with_descriptor on parallelized module with DTensor input to get the joint graph
  • [Done] Apply min_cut_partitioner to get forward and backward graph module
  • [Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives
  • [In Progress] Run the fw and bw graph with JointGraphRunner

Issues

Repro steps:
Needs to be run on top of pytorch/pytorch#164557
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Sample output:
P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025
@SherlockNoMad SherlockNoMad changed the title Joint Graph Runner JointGraph-based Training Prototype Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants