perf(pipeline): implement auto-partition algorithm #2113

TXacs · 2025-12-05T01:13:00Z

Auto-Partition in torchtitan

Overview

This PR provides an automatic partitioning method that considers the computation cost of embedding layers.
Thsi method involves calculating the floating-point operations (FLOPs) of the embedding layers and constructing an array that incorporates the FLOPs of both the transformer and embedding layers. Subsequently, a heuristic algorithm is employed to identify a balanced pipeline partition.

Solution Architecture

Dynamic Cost Analysis
Adaptive Partitioning Algorithm
Workload Balancing

Performance

Hardware configuration: 4x RTX 3090 24GB, pipeline parallelism dimension is 4.

llama3 配置对比

hidden size	layers	autopipe TPS	default TPS	Speedup
dim=256	6	31,094	29,549	+5.2%
dim=256	12	21,803	21,923	-0.5%
dim=2048	12	3,348	2,616	+28.0%
dim=4096	12	981	761	+28.9%

deepseekv3(without moe) 配置对比

hidden size	layers	autopipe TPS	default TPS	Speedup
dim=256	6	13,373	13,059	+2.4%
dim=256	12	7,714	6,859	+12.5%
dim=2048	12	4,331	3,810	+13.7%
dim=4096	12	2,888	2,561	+12.8%
dim=4096	16	2,207	2,008	+9.9%
dim=8192	16	4,331	3,935	+10.1%

1. Improve pipeline performance 2. Auto partition modules

meta-cla · 2025-12-05T01:13:06Z

Hi @TXacs!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

tianyu-l

Thanks. Is it true that the only "real" deltas are

autopipe.cpp
pipeline_parallel.py
profiler.py

tianyu-l · 2025-12-05T01:41:52Z

torchtitan/experiments/autopartition/infra/cpp/autopipe.cpp

This looks interesting -- how much benefit you'd get from having a c++ implementation, compared with a python one?

TXacs · 2025-12-05T01:52:41Z

Thanks. Is it true that the only "real" deltas are

autopipe.cpp

pipeline_parallel.py

profiler.py

Yes，actually, profile.py also uses the file from DeepSpeed. It would be even better if TorchTitan could provide a more authoritative FLOPs calculation method in the future, so that we could also adapt it for MoE models.

meta-cla · 2025-12-05T03:08:46Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

H-Huang · 2025-12-11T16:39:38Z

torchtitan/experiments/autopartition/infra/pipeline_parallel.py

+
+    parts = pipeline(
+        mflops_list,
+        [i * 3 for i in mflops_list],  # Assume backward is 3x forward


why is it assumed to be 3x?

H-Huang · 2025-12-11T16:41:51Z

torchtitan/experiments/autopartition/infra/pipeline_parallel.py

+    # Profile each layer's FLOPS
+    mflops_list = []
+    for _, layer in enumerate(model):
+        prof = FlopsProfiler(layer)


I guess the FlopsProfiler does not estimate the backward flops?

perf(pipeline): implement auto-partition algorithm

6a06ed7

1. Improve pipeline performance 2. Auto partition modules

tianyu-l reviewed Dec 5, 2025

View reviewed changes

tianyu-l requested a review from H-Huang December 5, 2025 01:59

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

Format to fix and add license

1f8b2f4

tianyu-l added the enhancement New feature or request label Dec 9, 2025

H-Huang reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(pipeline): implement auto-partition algorithm #2113

perf(pipeline): implement auto-partition algorithm #2113

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

H-Huang Dec 11, 2025

Uh oh!

H-Huang Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf(pipeline): implement auto-partition algorithm #2113

Are you sure you want to change the base?

perf(pipeline): implement auto-partition algorithm #2113

Uh oh!

Conversation

TXacs commented Dec 5, 2025

Auto-Partition in torchtitan

Overview

Solution Architecture

Performance

llama3 配置对比

deepseekv3(without moe) 配置对比

Uh oh!

meta-cla bot commented Dec 5, 2025

Action Required

Process

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

H-Huang Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants