2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)
Core Code:
- Code for parallel representation learning: fairseq\models\combine_transformer.py
- Code for combining convolution and self-attention: fairseq\modules\multihead_attention.py
- Code for acceleration, bm means big matrix: fairseq\models\transformer_bm.py
Relevent links:
- Arxiv pdf: https://arxiv.org/abs/1911.09483
- Pre-trained models as well as instructions for training: examples/parallel_intersected_multi-scale_attention(Prime)/README.md
- Reddit post link
About the paper:
TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
We ask three questions:
- Is attention alone good enough?
- Is parallel representation learning applicable to sequence data and tasks?
- How to design a module that combines both inductive bias of convolution and self-attention?
We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.
Key features:
- Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
- SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
- Parallel learn sequence representations and thus have potential for acceleration.
Results:
- Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
- The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )
Task | size | test (BLEU) |
---|---|---|
IWSLT14 De-En | Base | 36.3 |
WMT14 En-De | Large | 29.9 |
WMT14 En-Fr | Large | 43.5 |
- PyTorch version >= 1.0.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
- torch==1.3.1 with cuda==10.0
Installing from source
To install from source and develop locally:
pip install --editable . --user
We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.
Please cite as:
@article{zhao2019muse,
title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
journal={arXiv preprint arXiv:1911.09483},
year={2019}
}
The code is based on fairseq-0.6.2