[moe] brings batch/sequence-wise load balance loss #2061

rakkit · 2025-11-19T19:16:03Z

This is a draft PR for:

Make the moe's load_balance_coeff configurable
add the batch and seq-wise aux loss for load balance. [ref: dpskv3 eqn. 17~20]

For now, it only applies to the DeepSeek model, but I can add it for all other moe models at the end.
(also, we dont log the aux loss, but i can add it in optimizer hook to do this if you want)

The main concern is that the aux loss does not work well with PP. From what I have tested, it works well only with 1F1B. And it is broken for ZBV or interleaved 1f1b.

To test it:
[sequence_wise, by default]
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --training.extra_losses.load_balance_loss_weight=0.001

[batch_wise, need to pick this in ModelArgs]
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --training.extra_losses.load_balance_loss_weight=0.001

(turn it off)
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh

…d seq-wise aux loss for load balance

rakkit · 2025-11-19T19:17:04Z

torchtitan/train.py

            job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
        )

+        self.loss_fn = functools.partial(


we can add a condition here to wrap loss or not for MoE. for now all models in torchtitan only return a single output so its ok for now

If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

wwwjn

Thank you! @shuhuayu is working on a more formal review, and I have some house-keeping comments

wwwjn · 2025-11-21T21:24:02Z

torchtitan/config/job_config.py



+@dataclass
+class ExtraLosses:


This section is specifically for MoE load balancing loss for now, do you foresee any other loss related params will be used in this section? If not, let's make the name for descriptive and specific

Followup here. Should we merge these configs to the Model dataclass?

torchtitan/config/job_config.py

torchtitan/models/moe/moe.py

torchtitan/components/loss.py

shuhuayu

Thanks a lot for the pr @rakkit ! Made some comments here.

torchtitan/models/moe/moe.py

shuhuayu · 2025-11-24T20:22:20Z

torchtitan/train.py

            job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
        )

+        self.loss_fn = functools.partial(


If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

torchtitan/config/job_config.py

torchtitan/models/deepseek_v3/model/args.py

torchtitan/components/loss.py

torchtitan/models/moe/moe.py

rakkit · 2025-12-09T02:04:43Z

Thanks a lot for the feedback, @wwwjn @shuhuayu (sorry for the late update)!

Summary of new changes:

Made the MoE loss a wrapper, so we can now do
build_loss_fn = moe_loss_wrap(build_cross_entropy_loss)
when defining the model in TrainSpec.
Moved ExtraLosses to the Training scope.
The main purpose is to decouple this from the model definition.
Renamed load_balance_coeff to moe_aux_loss_free_bias_coeff — a bit longer, but clearer.
Now applied on moe models in models folder, (dpskv3, llama4, qwen)
Other refactors, thanks again @shuhuayu.

And be aware that the PP & aux-loss still does not work

lckr · 2025-12-10T18:30:19Z

torchtitan/models/moe/moe.py

+                    self.load_balance_loss_weight,
+                )
+        else:
+            load_balance_loss = torch.tensor(0.0, device=out.device, dtype=out.dtype)


As far as I can see out is not defined in this scope yet.

fixed. thanks : )

lckr · 2025-12-10T21:34:19Z

torchtitan/models/moe/moe.py

+    @staticmethod
+    def sequence_wise_aux_loss(
+        scores: torch.Tensor,
+        indices: torch.Tensor,


this will use the biased topk(scores + expert_bias) instead of the unbiased topk(scores) from DSv3 eq 18

nope, thats top_scores

ah yeah, scores is the raw sigmoid output. But isn't indices (= selected_experts_indices) derived as topk(scores + expert_bias)?

emm, good question. need to think about this.

i think you might be right, eq 18 the topk dont have "bias"

thanks. I fixed this and rerun the two aux loss types and no aux loss in PR description.

1) make the moe's load_balance_coeff configurable 2) add the batch an…

1c5ddd5

…d seq-wise aux loss for load balance

rakkit requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 19, 2025 19:16

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025

rakkit commented Nov 19, 2025

View reviewed changes

rakkit mentioned this pull request Nov 19, 2025

question of PP x aux_loss for MoE #1979

Open

tianyu-l requested a review from shuhuayu November 19, 2025 21:34

wwwjn reviewed Nov 21, 2025

View reviewed changes

shuhuayu reviewed Nov 24, 2025

View reviewed changes

fix loss build and move loss-weights to training scope, and other fixing

77dd533

lckr reviewed Dec 10, 2025

View reviewed changes

rakkit added 3 commits December 10, 2025 22:50

fix load balance loss when its disabled

1bdc48b

fix aux loss

2830bcb

fix

1df2412



		@dataclass
		class ExtraLosses:

[moe] brings batch/sequence-wise load balance loss #2061

Are you sure you want to change the base?

[moe] brings batch/sequence-wise load balance loss #2061

Conversation

rakkit commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuhuayu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lckr Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rakkit Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rakkit commented Nov 19, 2025 •

edited

Loading

lckr Dec 10, 2025 •

edited

Loading

rakkit Dec 10, 2025 •

edited

Loading