Skip to content

Conversation

@ruisizhang123
Copy link
Member

@ruisizhang123 ruisizhang123 commented Nov 24, 2025

Validate DSV3 manual bucketing when EP/TP are enable. Tested on DSV3-16B model. Dependent on Pytorch PR

(Single Node: BS = 1)

Node Method Parallelism Memory TPS Trace
1-Node (8H100) SimpleFSDP (aot_eager) FSDP=4 EP=2 51.11GiB(53.80%) 5,136 Link
1-Node (8H100) FSDP2-eager FSDP=4 EP=2 59.54GiB(62.68%) 5,942 Link
1-Node (8H100) SimpleFSDP (aot_eager) FSDP=2 TP=2 EP=2 42.21GiB(44.43%) 2,285 Link
1-Node (8H100) FSDP2-eager FSDP=2 TP=2 EP=2 45.41GiB(47.80%) 2,349 Link
8-Node (64H100) SimpleFSDP (aot_eager) FSDP=4 EP=2 Link
8-Node (64H100) FSDP2-eager FSDP=4 EP=2 Link
8-Node (64H100) SimpleFSDP (aot_eager) FSDP=2 TP=2 EP=2 Link
9-Node (64H100) FSDP2-eager FSDP=2 TP=2 EP=2 Link
  1. Example Trace
Screenshot 2025-12-10 at 7 51 23 PM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 24, 2025
@ruisizhang123 ruisizhang123 marked this pull request as draft November 24, 2025 17:19
@ruisizhang123 ruisizhang123 force-pushed the ruisi/fix_manual_bucketing_dsv3 branch from f931aa9 to 88b700b Compare December 11, 2025 05:23
@ruisizhang123 ruisizhang123 marked this pull request as ready for review December 11, 2025 05:24
),
"16B": DeepSeekV3ModelArgs(
vocab_size=102400,
dim=2048,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l Should we have another config to allow users to turn on/off flexattention? Currently, flexattention doesn't work well with AC here. cc. @soulitzer for AC issue follow up!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the symptom?

also if it doesn't work why do we add an entry for it -- is it for repro?

for m in modules:
if isinstance(m, list):
result.append(convert_modules_to_fqns(m, module_to_fqn_mapping))
if fqn_list := convert_modules_to_fqns(m, module_to_fqn_mapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the syntax mean -- assigning to fqn_list and check not None? It feels a bit unusual to read.

Also please add a comment on why we need this check

),
"16B": DeepSeekV3ModelArgs(
vocab_size=102400,
dim=2048,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the symptom?

also if it doesn't work why do we add an entry for it -- is it for repro?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants