fix: pp grad accumulation is broken (#1732)

jdinalt · web-flow · commit 3e1b843ecd91 · 2025-09-24T13:13:05.000-07:00
[problem] Using gradient accumulation is incompatible with PipleineSchedule(..., scale_grads=True) option, which defaults to True. When this option is set, at each step, all gradients are scaled by the micro-batch size. This works fine for a single gradient accumulation step, but when using multiple steps, this will rescale the total gradient by this factor, not just at the end of gradient accumulation. The result is that the accumulated gradient is an exponential moving average, rather than a sum. Overall, the resulting gradients are much smaller than they should be and using gradient accumulation with PP is not equivalent to using it without PP -- the loss curves diverge substantially, as well as the gradient-norms are way off. A secondary consequence of is that at every step, it divides the gradients by n_microbatches, which is computationally expensive when applied to a large model. [solution] Set "scale_grads=False" when creating the scheduler instance. Compute "n_microbatches" in the constructor and apply this factor, along with gradient_accumulation_steps, to the scale factor in "rescale_accumulated_loss()". This will cause the loss to be scaled, rather than the gradients, at each step by the correct factor. A secondary benifit of this approach is that it avoids having to modify all of the gradients. It's much cheaper, computationally than modifying all of the gradients -- and it's correct, which it is not, without the change. A side effect of the previous change is that the loss values returned by the pipeline have been scaled by this factor, which makes them too small by a factor of n_microbatches. We can correct this by rescaling the returned loss by the same factor. [testing] Witout these changes, a baseline run, with 10 gradient accumulation steps, on a single GPU is compared against a run (without the changes) to a 2 GPU pipeline, using 1F1B. The effective batch size is 320 in both cases, with all other variables controlled. The result is a substantial divergence between the loss curves and gradient-norm of the two runs. With this change applied, the results are nearly identical, ignoring minor differences from non-determinism. [references] scale_grads option: https://github.com/pytorch/pytorch/blob/281bb56cc50073159c8418c5c99c7459c914c4db/torch/distributed/pipelining/schedules.py#L286 scale_grads implementation: https://github.com/pytorch/pytorch/blob/281bb56cc50073159c8418c5c99c7459c914c4db/torch/distributed/pipelining/stage.py#L567 Test code for reproduction of the issue and the testing the fix: https://github.com/jdinalt/forgather/tree/main/examples/torchtitan/test_parallelisms
diff --git a/torchtitan/components/validate.py b/torchtitan/components/validate.py
@@ -147,7 +147,10 @@ def validate(
                 # accumulate losses across pipeline microbatches
                 # TODO: PP+FSDP unexpectedly puts the loss back to the CPU
                 loss = (
-                    torch.mean(torch.stack(losses)).to(device_type)
+                    # using sum instead of mean because we already rescale the
+                    # loss_fn down by a factor of n_microbatches in
+                    # torchtitan/distributed/pipeline_parallel.py
+                    torch.sum(torch.stack(losses)).to(device_type)
                     if self.pp_has_last_stage
                     else torch.tensor([-1.0], device=device_type)
                 )
diff --git a/torchtitan/distributed/pipeline_parallel.py b/torchtitan/distributed/pipeline_parallel.py
@@ -22,6 +22,7 @@
     ScheduleZBVZeroBubble,
 )
 
+from torchtitan.components.loss import rescale_accumulated_loss
 from torchtitan.config import JobConfig
 from torchtitan.tools.logging import logger
 
@@ -82,7 +83,8 @@ def build_pipeline_schedule(
     schedule = schedule_class(
         stages if looped_schedule else stages[0],
         n_microbatches=n_microbatches,
-        loss_fn=loss_fn,
+        loss_fn=rescale_accumulated_loss(loss_fn, n_microbatches),
+        scale_grads=False,
     )
     logger.info(
         f"Using pipeline schedule {job_config.parallelism.pipeline_parallel_schedule} "
diff --git a/torchtitan/experiments/deepseek_v3/train_ds_dev.py b/torchtitan/experiments/deepseek_v3/train_ds_dev.py
@@ -126,7 +126,10 @@ def run_full_model(
                 y = pp_schedule.step(x)
             elif pp_rank == pp_size - 1:
                 y = pp_schedule.step(target=label, losses=losses)
-                loss = torch.mean(torch.stack(losses))
+                # using sum instead of mean because we already rescale the
+                # loss_fn down by a factor of n_microbatches in
+                # torchtitan/distributed/pipeline_parallel.py
+                loss = torch.sum(torch.stack(losses))
             else:
                 pp_schedule.step()
         else:
diff --git a/torchtitan/experiments/forge/example_train.py b/torchtitan/experiments/forge/example_train.py
@@ -197,7 +197,10 @@ def forward_backward_step(
             # accumulate losses across pipeline microbatches
             # TODO: PP+FSDP unexpectedly puts the loss back to the CPU
             loss = (
-                torch.mean(torch.stack(losses)).to(self.device)
+                # using sum instead of mean because we already rescale the
+                # loss_fn down by a factor of n_microbatches in
+                # torchtitan/distributed/pipeline_parallel.py
+                torch.sum(torch.stack(losses)).to(self.device)
                 if self.pp_has_last_stage
                 else torch.tensor([-1.0], device=self.device)
             )
diff --git a/torchtitan/train.py b/torchtitan/train.py
@@ -457,7 +457,10 @@ def forward_backward_step(
             # accumulate losses across pipeline microbatches
             # TODO: PP+FSDP unexpectedly puts the loss back to the CPU
             loss = (
-                torch.mean(torch.stack(losses)).to(self.device)
+                # using sum instead of mean because we already rescale the
+                # loss_fn down by a factor of n_microbatches in
+                # torchtitan/distributed/pipeline_parallel.py
+                torch.sum(torch.stack(losses)).to(self.device)
                 if self.pp_has_last_stage
                 else torch.tensor([-1.0], device=self.device)
             )