Improve AMP stability by michaelmckinsey1 · Pull Request #60 · LBANN/ScaFFold

michaelmckinsey1 · 2026-04-23T21:31:21Z

Compute dice loss in FP32 to avoid val_dice_score=nan
Use BF16 to prevent numerical overflow.
The datatype for AMP is now centrally configurable in ScaFFold/utils/data_types.py for testing`.

PatrickRMiles · 2026-04-30T21:05:15Z

+                # --- 2. Sharded Dice Loss ---
+                mask_pred_probs = F.softmax(local_preds.float(), dim=1)
+                mask_true_onehot = (
+                    F.one_hot(local_labels, n_categories + 1)
+                    .permute(0, 4, 1, 2, 3)
+                    .float()
+                )
+
+                # Dice loss uses probabilities
+                dice_score_probs = compute_sharded_dice(
+                    mask_pred_probs, mask_true_onehot, spatial_mesh
+                )


It looks like this got inserted in the middle of the CE loss calc. Can you move it back to being after CE_loss = ...? This should also shrink the diff + make it more clear what the actual changes are here (not casting local_preds to float)

PatrickRMiles · 2026-04-30T21:07:04Z


        # Set up gradient scaler for AMP (Automatic Mixed Precision)
-        self.grad_scaler = torch.amp.GradScaler("cuda", enabled=self.config.torch_amp)
+        self.use_grad_scaler = self.config.torch_amp and self.amp_dtype == torch.float16


Is the and self.amp_dtype == torch.float16 basically just catching the case where we're NOT running with bf16? Would it be better to write that explicitly, like and self.amp_dtype != torch.bfloat16?

Yes basically the options are bf16 or f16. I don't think bf/f8 would work. I am ok with making this change

PatrickRMiles · 2026-04-30T21:09:37Z

+                    # 2. Sharded Dice Loss
+                    local_preds_softmax = F.softmax(local_preds.float(), dim=1)
+                    local_labels_one_hot = (
+                        F.one_hot(
+                            local_labels, num_classes=self.config.n_categories + 1
+                        )
+                        .permute(0, 4, 1, 2, 3)
+                        .float()
+                    )
+                    dice_scores = compute_sharded_dice(
+                        local_preds_softmax, local_labels_one_hot, self.spatial_mesh
+                    )


Same as in evaluate.py, I think this should stay after global_ce_sum = ...

PatrickRMiles · 2026-04-30T21:12:12Z

+                                # Sharded Dice Loss
+                                local_preds_softmax = F.softmax(
+                                    local_preds.float(), dim=1
+                                )
+                                local_labels_one_hot = (
+                                    F.one_hot(
+                                        local_labels,
+                                        num_classes=self.config.n_categories + 1,
+                                    )
+                                    .permute(0, 4, 1, 2, 3)
+                                    .float()
+                                )
+                                # Compute sharded dice
+                                dice_scores = compute_sharded_dice(
+                                    local_preds_softmax,
+                                    local_labels_one_hot,
+                                    self.spatial_mesh,
+                                )


Same as evaluate and warmup, this should come after global_ce_sum = ...

michaelmckinsey1 added 2 commits April 23, 2026 14:24

bf16 and more fp32 sections for dice

e4abdfc

Refactor

e4a4233

michaelmckinsey1 requested a review from PatrickRMiles April 23, 2026 21:31

michaelmckinsey1 self-assigned this Apr 23, 2026

michaelmckinsey1 added 3 commits April 23, 2026 14:32

ruff

2b6caa7

Merge remote-tracking branch 'origin/main' into fix-amp

f3c1906

fix merge artifact

235a913

michaelmckinsey1 mentioned this pull request Apr 24, 2026

Implement CosineAnnealingWarmRestarts LR Scheduler #61

Merged

2 tasks

Update trainer.py

49eccfa

PatrickRMiles requested changes Apr 30, 2026

View reviewed changes

michaelmckinsey1 added 2 commits April 30, 2026 14:20

Update trainer.py

14e3bef

Update trainer.py

535c3aa

michaelmckinsey1 requested a review from PatrickRMiles April 30, 2026 21:28

michaelmckinsey1 added 2 commits April 30, 2026 14:54

Refactor

69bb3ba

mv .item()

34c5281

PatrickRMiles approved these changes Apr 30, 2026

View reviewed changes

michaelmckinsey1 merged commit 574c081 into LBANN:main Apr 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AMP stability#60

Improve AMP stability#60
michaelmckinsey1 merged 10 commits intoLBANN:mainfrom
michaelmckinsey1:fix-amp

michaelmckinsey1 commented Apr 23, 2026

Uh oh!

PatrickRMiles Apr 30, 2026

Uh oh!

michaelmckinsey1 Apr 30, 2026

Uh oh!

PatrickRMiles Apr 30, 2026

Uh oh!

michaelmckinsey1 Apr 30, 2026

Uh oh!

PatrickRMiles Apr 30, 2026

Uh oh!

michaelmckinsey1 Apr 30, 2026

Uh oh!

Uh oh!

PatrickRMiles Apr 30, 2026

Uh oh!

michaelmckinsey1 Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelmckinsey1 commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants