Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Dec 9, 2025

Stacked PRs:


[Autoparallel] Add local_map variant of DSv3 and 2D mesh AP

Currently, the AP experiment monkey patches Titan's main DSv3 implementation. But this is prone to breakage from both model definition changes in titan and from HOP/partitioner related changes in core. When these breaks happen, people are usually blocked until I find the root cause.

I'm going on PTO for the rest of the year, so I'm adding an integration to AP's DSv3 model in an attempt to make the development more stable for the upcoming PP integration.

Test: https://gist.github.com/xmfan/db15fda1e1bc1df7cd523005fe0baf33

xmfan added a commit that referenced this pull request Dec 9, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 9, 2025
xmfan added a commit that referenced this pull request Dec 9, 2025
@xmfan xmfan requested a review from sanketpurandare December 9, 2025 18:13
@xmfan xmfan changed the base branch from xmfan/stack/6 to main December 9, 2025 18:18
xmfan added a commit that referenced this pull request Dec 9, 2025
@xmfan xmfan changed the base branch from main to xmfan/stack/6 December 9, 2025 18:18
xmfan added a commit that referenced this pull request Dec 9, 2025
@xmfan xmfan changed the base branch from xmfan/stack/6 to main December 9, 2025 21:24
xmfan added a commit that referenced this pull request Dec 12, 2025
xmfan added a commit that referenced this pull request Dec 12, 2025
for layer in model.layers.values():
if layer.moe_enabled:
layer.moe.mesh = world_mesh
layer.moe.axis_name = "dp_shard_in_ep"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems okay for now, but when we enable TP, this should change so just add a comment that modify this when enabling TP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assert to check for dp2ep specifically, because i don't think we're handling the mesh setup well...

parallel_dims.tp_enabled
and not job_config.parallelism.disable_loss_parallel
)
assert not loss_parallel_enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of failing here why not disable loss parallel and give a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnings are easy to miss, you won't know loss parallel is disable if you don't spot it before titan starts dumping its logs

Copy link
Contributor

@sanketpurandare sanketpurandare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments that can be addressed optionally

@xmfan xmfan merged commit 995154f into main Dec 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants