-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the word_emb for cross attention #8
Comments
Hi, we use [MASK] tokens for generation by iterative decoding and [PAD] tokens to fill up the shorter length samples. [PAD] tokens in CLIP model can be in similar manner. Since we only use text tokens as a condition (not for generation), no need [MASK] tokens for text. |
I mean that when perform cross attention between word embed (key and value) and motion token(query), will the [PAD] tokens from CLIP introduce the noise to motion token ? Compared with global text condition, additionally using the fine-grained word embeds can bring performance gain ? Look forward your reply. |
The model should learn to ignore [PAD] tokens (following CLIP). We create a wrapper class here: Lines 76 to 80 in 2f7e3b2
Applying local text embedding shows the trade-off between R-precision and FID score. Please see table 9 in the supplementary. |
Thanks for your great work! I wonder that the length of text usually less than 77. Why not mask the padding tokens in word_emb when performing cross attention?
The text was updated successfully, but these errors were encountered: