Clear output of Torch SDPA for masked pieces #360

danieldk · 2024-02-08T11:08:37Z

Description

Since Torch 2.1, the Torch memory-efficient SDPA GPU kernel returns NaN for pieces that are completely masked out. This leads to NaN propagation in the next attention layer, because masked pieces get an attention of zero, but zero times NaN is still NaN.

In this we fix this by setting masked tokens to zero to clear out any NaNs.

We currently rely on the query dimension of the mask to be singular, but in the future we should probably redesign the AttentionMask class to account for the differences between attention masks and causal masks.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.

Since Torch 2.1, the Torch memory-efficient SDPA GPU kernel returns NaN for pieces that are completely masked out. This leads to NaN propagation in the next attention layer, because masked pieces get an attention of zero, but zero times NaN is still NaN. In this we fix this by setting masked tokens to zero to clear out any NaNs. We currently rely on the query dimension of the mask to be singular, but in the future we should probably redesign the `AttentionMask` class to account for the differences between attention masks and causal masks.

curated_transformers/layers/attention.py

Co-authored-by: Madeesh Kannan <[email protected]>

danieldk added type/bug Type: Bug feat/layers Feature: Layers labels Feb 8, 2024

danieldk force-pushed the bugfix/sdpa-attention-nan branch from be5af05 to 814d04d Compare February 8, 2024 11:20

danieldk added 2 commits February 8, 2024 12:38

black

31631f8

danieldk force-pushed the bugfix/sdpa-attention-nan branch from 5031bc8 to 31631f8 Compare February 8, 2024 11:38

Update MyPy version to one that supports recent PyTorch

fc783bb

danieldk force-pushed the bugfix/sdpa-attention-nan branch from 121d69b to fc783bb Compare February 8, 2024 12:13

Comment typos and fixes

130df32

danieldk mentioned this pull request Feb 8, 2024

Remove support for TorchScript tracing #361

Merged

1 task

shadeMe reviewed Feb 8, 2024

View reviewed changes

curated_transformers/layers/attention.py Outdated Show resolved Hide resolved

Daniël de Kok and others added 2 commits February 8, 2024 20:02

Add assertion message

eefe900

Co-authored-by: Madeesh Kannan <[email protected]>

black

5b210d3

danieldk merged commit f9da3b5 into explosion:main Feb 8, 2024
9 checks passed

danieldk deleted the bugfix/sdpa-attention-nan branch February 8, 2024 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear output of Torch SDPA for masked pieces #360

Clear output of Torch SDPA for masked pieces #360

danieldk commented Feb 8, 2024

Clear output of Torch SDPA for masked pieces #360

Clear output of Torch SDPA for masked pieces #360

Conversation

danieldk commented Feb 8, 2024

Description

Checklist