Is there some errors in transformers mha.py


in the file labml_nn/transformer/mha.py

```python

    def forward(self, x: torch.Tensor):
        # Input has shape `[seq_len, batch_size, d_model]` or `[batch_size, d_model]`.
        # We apply the linear transformation to the last dimension and split that into
        # the heads.
        head_shape = x.shape[:-1]

        # Linear transform
        x = self.linear(x)

        # Split last dimension into heads
        x = x.view(*head_shape, self.heads, self.d_k)

        # Output has shape `[seq_len, batch_size, heads, d_k]` or `[batch_size, heads, d_model]`
        return x
```

I have question about:

The first:
```python
        # Input has shape `[seq_len, batch_size, d_model]` or `[batch_size, d_model]`.
        # Output has shape `[seq_len, batch_size, heads, d_k]` or `[batch_size, heads, d_model]`
 ```
I think Output should be `[seq_len, batch_size, heads, d_k]` or `[batch_size, heads, d_k]`

The second:
```python

class MultiHeadAttention(nn.Module):
    r"""
    <a id="MHA"></a>

    ## Multi-Head Attention Module

    This computes scaled multi-headed attention for given `query`, `key` and `value` vectors.

    $$\mathop{Attention}(Q, K, V) = \underset{seq}{\mathop{softmax}}\Bigg(\frac{Q K^\top}{\sqrt{d_k}}\Bigg)V$$

    In simple terms, it finds keys that matches the query, and gets the values of
     those keys.

    It uses dot-product of query and key as the indicator of how matching they are.
    Before taking the $softmax$ the dot-products are scaled by $\frac{1}{\sqrt{d_k}}$.
    This is done to avoid large dot-product values causing softmax to
    give very small gradients when $d_k$ is large.

    Softmax is calculated along the axis of of the sequence (or time).
    """

    def __init__(self, heads: int, d_model: int, dropout_prob: float = 0.1, bias: bool = True):
        """
        * `heads` is the number of heads.
        * `d_model` is the number of features in the `query`, `key` and `value` vectors.
        """

        super().__init__()

        # Number of features per head
        self.d_k = d_model // heads
        # Number of heads
        self.heads = heads

        # These transform the `query`, `key` and `value` vectors for multi-headed attention.
        self.query = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias)
        self.key = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias)
        self.value = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=True)

        # Softmax for attention along the time dimension of `key`
        self.softmax = nn.Softmax(dim=1)

        # Output layer
        self.output = nn.Linear(d_model, d_model)
        # Dropout
        self.dropout = nn.Dropout(dropout_prob)
        # Scaling factor before the softmax
        self.scale = 1 / math.sqrt(self.d_k)

        # We store attentions so that it can be used for logging, or other computations if needed
        self.attn = None
```
I think the code 
```python
        # Softmax for attention along the time dimension of `key`
        self.softmax = nn.Softmax(dim=1)
```
`dim` should be -1

Output should be `[seq_len, batch_size, heads, d_k]`  and after dot-product with keys, the result  should be `[seq_len, batch_size, heads, seq_len]` , and the key dimension should be the last one, so I think `dim` should be -1.
I also have troubles of the case of output is `[batch_size, heads, d_k]`, I have no idea what the result is after dot-product,
and what dim should be


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there some errors in transformers mha.py #216

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is there some errors in transformers mha.py #216

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions