Skip to content

Conversation

@TXacs
Copy link

@TXacs TXacs commented Dec 5, 2025

Auto-Partition in torchtitan

Overview

This PR provides an automatic partitioning method that considers the computation cost of embedding layers.
Thsi method involves calculating the floating-point operations (FLOPs) of the embedding layers and constructing an array that incorporates the FLOPs of both the transformer and embedding layers. Subsequently, a heuristic algorithm is employed to identify a balanced pipeline partition.

Solution Architecture

  1. Dynamic Cost Analysis
  2. Adaptive Partitioning Algorithm
  3. Workload Balancing

Performance

Hardware configuration: 4x RTX 3090 24GB, pipeline parallelism dimension is 4.

llama3 配置对比

hidden size layers autopipe TPS default TPS Speedup
dim=256 6 31,094 29,549 +5.2%
dim=256 12 21,803 21,923 -0.5%
dim=2048 12 3,348 2,616 +28.0%
dim=4096 12 981 761 +28.9%

deepseekv3(without moe) 配置对比

hidden size layers autopipe TPS default TPS Speedup
dim=256 6 13,373 13,059 +2.4%
dim=256 12 7,714 6,859 +12.5%
dim=2048 12 4,331 3,810 +13.7%
dim=4096 12 2,888 2,561 +12.8%
dim=4096 16 2,207 2,008 +9.9%
dim=8192 16 4,331 3,935 +10.1%

1. Improve pipeline performance
2. Auto partition modules
@meta-cla
Copy link

meta-cla bot commented Dec 5, 2025

Hi @TXacs!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is it true that the only "real" deltas are

  • autopipe.cpp
  • pipeline_parallel.py
  • profiler.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting -- how much benefit you'd get from having a c++ implementation, compared with a python one?

@TXacs
Copy link
Author

TXacs commented Dec 5, 2025

Thanks. Is it true that the only "real" deltas are

  • autopipe.cpp
  • pipeline_parallel.py
  • profiler.py

Yes,actually, profile.py also uses the file from DeepSpeed. It would be even better if TorchTitan could provide a more authoritative FLOPs calculation method in the future, so that we could also adapt it for MoE models.

@tianyu-l tianyu-l requested a review from H-Huang December 5, 2025 01:59
@meta-cla
Copy link

meta-cla bot commented Dec 5, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025
@tianyu-l tianyu-l added the enhancement New feature or request label Dec 9, 2025

parts = pipeline(
mflops_list,
[i * 3 for i in mflops_list], # Assume backward is 3x forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it assumed to be 3x?

# Profile each layer's FLOPS
mflops_list = []
for _, layer in enumerate(model):
prof = FlopsProfiler(layer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the FlopsProfiler does not estimate the backward flops?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants