- sources: [thunlp/PLMpapers].
- sources: [Jiakui/awesome-bert].
- Transferring NLP models across languages and domains, [slides].
- [2017 ICML] Language Modeling with Gated Convolutional Networks, [paper], [bibtex], sources: [anantzoid/Language-Modeling-GatedCNN], [jojonki/Gated-Convolutional-Networks].
- [2017 NIPS] Learned in Translation: Contextualized Word Vectors, [paper], [bibtex], sources: [salesforce/cove].
- [2018 ICLR] Regularizing and Optimizing LSTM Language Models, [paper], [bibtex], sources: [salesforce/awd-lstm-lm], author page: [Nitish Shirish Keskar].
- [2018 NAACL] Deep contextualized word representations, [paper], [bibtex], [homepage], sources: [allenai/bilm-tf], [HIT-SCIR/ELMoForManyLangs]. Some extended application: [UKPLab/elmo-bilstm-cnn-crf].
- [2018 NeurIPS] GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations, [paper], [bibtex], sources: [YJHMITWEB/GLoMo-tensorflow].
- [2018 ArXiv] Improving Language Understanding by Generative Pre-Training, [paper], [bibtex], [homepage], sources: [openai/finetune-transformer-lm].
- [2019 AAAI] Character-Level Language Modeling with Deeper Self-Attention, [paper], [bibtex], sources: [nadavbh12/Character-Level-Language-Modeling-with-Deeper-Self-Attention-pytorch].
- [2019 NAACL] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, [paper], [bibtex], [slides], sources: [google-research/bert], [huggingface/pytorch-pretrained-BERT]. Blog posts:
- [2019 ACL] Adaptive Attention Span in Transformers, [paper], [bibtex], sources: [facebookresearch/adaptive-span].
- [2019 ICML] BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning, [paper], [bibtex], [supplementary], sources: [AsaCooperStickland/Bert-n-Pals].
- [2019 ArXiv] GPT-2: Language Models are Unsupervised Multitask Learners, [paper], [bibtex], [homepage], sources: [openai/gpt-2].
- [2019 ICLR] What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations, [paper], [bibtex].
- [2019 ICML] MASS: Masked Sequence to Sequence Pre-training for Language Generation, [paper], [bibtex], sources: [xutaatmicrosoftdotcom/MASS].
- [2019 ACL] ERNIE: Enhanced Language Representation with Informative Entities, [paper], [bibtex], [blog], sources: [thunlp/ERNIE].
- [2019 ACL] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, [paper], [bibtex], sources: [kimiyoung/transformer-xl].
- [2019 IJCNLP] Cloze-driven Pretraining of Self-attention Networks, [paper], [bibtex].
- [2019 NeurIPS] XLNet: Generalized Autoregressive Pretraining for Language Understanding, [paper], [bibtex], [Supplementary], sources: [zihangdai/xlnet].
- [2019 NeurIPS] Cross-lingual Language Model Pretraining, [paper], [bibtex], sources: [facebookresearch/XLM].
- [2019 NeurIPS] Unified Language Model Pre-training for Natural Language Understanding and Generation, [paper], [bibtex], sources: [microsoft/unilm].
- [2019 ICML] Improving Neural Language Modeling via Adversarial Training, [paper], [bibtex], sources: [ChengyueGongR/advsoft].
- [2019 ArXiv] RoBERTa: A Robustly Optimized BERT Pretraining Approach, [paper], [bibtex], sources: [pytorch/fairseq].
- [2019 ArXiv] NeZha: Neural Contextualized Representation for Chinese Language Understanding, [paper], [bibtex], sources: [huawei-noah/Pretrained-Language-Model/NEZHA].
- [2020 AAAI] K-BERT: Enabling Language Representation with Knowledge Graph, [paper], [bibtex].
- [2020 ICLR] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, [paper], [bibtex].
- [2020 ICLR] ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations, [paper], [bibtex], sources: [google-research/ALBERT].
- [2020 ICLR] FreeLB: Enhanced Adversarial Training for Natural Language Understanding, [paper], [bibtex], sources: [zhuchen03/FreeLB].
- [2020 ICLR] Improving Neural Language Generation with Spectrum Control, [paper], [bibtex].
- [2020 ACL] Emerging Cross-lingual Structure in Pretrained Language Models, [paper], [bibtex].
- [2020 ACL] MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, [paper], [bibtex], sources: [google-research/mobilebert].
- [2020 ACL] Entities as Experts: Sparse Memory Access with Entity Supervision, [paper], [bibtex].
- [2020 TACL] SpanBERT: Improving Pre-training by Representing and Predicting Spans, [paper], [bibtex], sources: [facebookresearch/SpanBERT].
- [2020 EMNLP] TinyBERT: Distilling BERT for Natural Language Understanding, [paper], [bibtex], sources: [huawei-noah/TinyBERT].
- [2020 NeurIPS] Language Through a Prism: A Spectral Approach for Multiscale Language Representations, [paper], [bibtex].
- [2020 ArXiv] Critical Thinking for Language Models, [paper], [bibtex].
- [2021 ICLR] DeBERTa: Decoding-enhanced BERT with Disentangled Attention, [paper], [bibtex], sources: [microsoft/DeBERTa].
- [2021 ArXiv] All NLP Tasks Are Generation Tasks: A General Pretraining Framework, [paper], [bibtex], sources: [THUDM/GLM].
- [2021 ArXiv] An Attention Free Transformer, [paper], [bibtex], sources: [rish-16/aft-pytorch].
- [2021 NAACL] INFOXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training, [paper], [bibtex], sources: [microsoft/infoxlm].
- [2021 EMNLP] UNKs Everywhere: Adapting Multilingual Language Models to New Scripts, [paper], [bibtex], sources: [Adapter-Hub/UNKs_everywhere].
- [2021 NeurIPS] Pay Attention to MLPs, [paper], [bibtex], sources: [rwightman/pytorch-image-models], [labmlai/annotated_deep_learning_paper_implementations], [xmu-xiaoma666/External-Attention-pytorch], [PaddleViT/gMLP].
- [2022 ArXiv] Efficient Language Modeling with Sparse all-MLP, [paper], [bibtex].
- [2022 ICLR] ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning, [paper], [bibtex], sources: [google-research/text-to-text-transfer-transformer], [tensorflow/mesh].
- [2019 ACL] How multilingual is Multilingual BERT?, [paper], [bibtex].
- [2019 ACL] What does BERT learn about the structure of language?, [paper], [bibtex], sources: [ganeshjawahar/interpret_bert].
- [2019 EMNLP] Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT, [paper], [bibtex], sources: [shijie-wu/crosslingual-nlp].
- [2019 EMNLP] How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings, [paper], [bibtex], sources: [kawine/contextual].
- [2019 ICLR] Representation Degeneration Problem in Training Natural Language Generation Models, [paper], [bibtex]
- [2019 ArXiv] What does BERT Learn from Multiple-Choice Reading Comprehension Datasets?, [paper], [bibtex].
- [2020 JMLR] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, [paper], [bibtex], sources: [google-research/text-to-text-transfer-transformer].
- [2018 NAACL] Self-Attention with Relative Position Representations, [paper], [bibtex], sources: [TensorUI/relative-position-pytorch], [tensorflow/tensor2tensor], [OpenNMT/OpenNMT-tf].
- [2020 ICML] Learning to Encode Position for Transformer with Continuous Dynamical Model, [paper], [bibtex], sources: [xuanqing94/FLOATER].
- [2021 ICLR] Rethinking Positional Encoding in Language Pre-training, [paper], [bibtex], sources: [guolinke/TUPE].