Skip to content

Jiltseb/awesome_speech_papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

About This Repository


This repository is for those who want to study or research Speech tasks ( Speech Recognition, Speecn Synthesis so on).


이 페이지는 음성 관련 task (음성 인식, 음성 합성  등)를 공부 및 연구하고 싶은 newbie들을 위해 만들어짐. 
최대한 페이퍼를 많이 포함하기 보다는 중요하고(citation이 충분히 높고, 신뢰할 만한 기관에서 수행했으며, 
top 컨퍼런스/에 publish된 페이퍼 위주) 최신자 페이퍼들만 포함하려고 함.(주관적일 수 있음) 

갑자기 잡동사니가 되었습니다.


temporal (training schemes or undefined)

  • don't decay the learning rate, increase the batch size, paper
  • when does label smoothing help? paper
  • Bag of Tricks for Efficient Text Classification paper
  • SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition paper
  • State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions paper

Index

  • 1.End-to-End Speech Recognition papers

    • CTC-based ASR papers
    • Attention-based ASR papers
    • Hybrid ASR papers
    • RNN-T based ASR papers
    • Streaming ASR papers
  • 2.End-to-End Speech Synthesis papers

  • 3.End-to-End Non-Autoregressive Sequence Generation papers

    • ASR
    • NMT
    • TTS
  • 4.End-to-End Spoken Language Understanding

    • Intent Classification papers
    • Spoken Question Answering papers
    • Speech Emotion Recognition papers
  • 5.Self-Supervised(or Semi-Supervised) Learning for Speech

  • TBC

    • Voice Conversion
    • Speaker Identification
    • MIR ?
    • Rescoring
    • Speech Translation



1. End-to-End Speech Recognition

1.1 CTC based ASR model

< Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin >


year conference research organization title model link code
2006 ICML Toronto University Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks CTC paper
2014 Deep speech: Scaling up end-to-end speech recognition
2016 ICML Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin CTC-based CNN model paper code(pytorch)
2019 Interspeech Nvidia Jasper: An End-to-End Convolutional Neural Acoustic Model
2019 Nvidia Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions

1.2 Attention based ASR model

< Listen, Attend and Spell >


year conference research organization title model link code
2008 Supervised Sequence Labelling with Recurrent Neural Networks
2014 ICML Towards End-to-End Speech Recognition with Recurrent Neural Networks
2015 NIPS Attention-Based Models for Speech Recognition Seq2Seq
2015 ICASSP Google Listen, Attend and Spell Seq2Seq paper code(pytorch)
2016 End-to-End Attention-based Large Vocabulary Speech Recognition
2017 ICLR Monotonic Chunkwise Attention
2018 ICASSP Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
2019 Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
2019 A Comparative Study on Transformer vs RNN in Speech Applications paper
2019 End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures paper
2020 Google Conformer: Convolution-augmented Transformer for Speech Recognition paper

1.3 Hybrid Model

year conference research organization title model link code
2019 Transformer-based Acoustic Modeling for Hybrid Speech Recognition paper

1.4 RNN-T based ASR model

< Streaming E2E Speech Recognition For Mobile Devices >


year conference research organization title model link code
2012 Sequence Transduction with Recurrent Neural Networks
2018 ICASSP Google Streaming E2E Speech Recognition For Mobile Devices paper
2018 Google Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer
2019 Google Improving RNN Transducer Modeling for End-to-End Speech Recognition
2019 - Self-Attention Transducers for End-to-End Speech Recognition
2020 ICASSP - Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss
2020 ICASSP - A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
2021 ICASSP - FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
2021 ICASSP - Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer
2020 Google ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context paper

1.5 Streaming ASR

< Two-Pass End-to-End Speech Recognition >


year conference research organization title model link code
2019 Google Two-Pass End-to-End Speech Recognition LAS+RNN-T paper

1.5 ASR Rescoring / Spelling Correction (2-pass decoding)


temporal

year conference research organization title model task link code
2019 Automatic Speech Recognition Errors Detection and Correction
2019 A Spelling Correction Model For E2E Speech Recognition
2019 An Empirical Study Of Efficient ASR Rescoring With Transformers
2019 Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition
2019 Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model
2019 Effective Sentence Scoring Method Using BERT for Speech Recognition asr
2019 Spelling Error Correction with Soft-Masked BERT nlp
2019 Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition asr


2. End-to-End Speech Synthesis

< Tacotron: Towards End-to-End Speech Synthesis >



year conference research organization title model link code
2016 Deepmind WaveNet: A Generative Model for Raw Audio paper
2017 ICLR - SampleRNN: An Unconditional End-to-End Neural Audio Generation Model paper code(official)
2017 ICLR Montreal Univ, CIFAR Char2Wav: End-to-End Speech Synthesis paper
2017 PMLR Baidu Research Deep Voice: Real-time Neural Text-to-Speech paper
2017 NIPS Baidu Research Deep Voice 2: Multi-Speaker Neural Text-to-Speech paper
2017 Baidu Research Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning paper code
2017 Google Tacotron: Towards End-to-End Speech Synthesis paper code(tensorflow), code(pytorch)
2017 ICML Emotional End-to-End Neural Speech Synthesizer
2018 ICML Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
2018 ICML Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
2021 ICLR Google Research Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling paper
2018 Adversarial Audio Synthesis GAN paper code(official, tensorflow)
2019 ICASSP Nvidia WaveGlow: a Flow-based Generative Network for Speech Synthesis paper code(official, pytorch)
2019 Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram paper
2019 NIPS NVIDIA FastSpeech: Fast, Robust and Controllable Text to Speech paper
2020 - NVIDIA FastSpeech 2: Fast and High-Quality End-to-End Text to Speech paper
2020 NIPS Kakao Enterprise, SNU Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search paper
2020 ICASSP Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow paper
2019 AAAI Neural Speech Synthesis with Transformer Network paper
2017 Parallel WaveNet: Fast High-Fidelity Speech Synthesis
2018 - WaveGlow: A Flow-based Generative Network for Speech Synthesis
2020 ICASSP Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis


3. End-to-End Non-Autoregressive Sequence Generation Model


Non-Autoregressive 모델은 논문이 별로 없기 때문에 기계번역(NMT)/음성인식(STT)/음성합성(STT) 모두 포함하려고 함.

3.1 Non-Autoregressive(NA) NMT

< NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION >


< Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior >


year conference research organization title model link code
2018 ICLR The University of Hong Kong NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION
2020 Google Non-Autoregressive Machine Translation with Latent Alignments
2020 CMU FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
2020 CMU,Berkeley,Peking University Fast Structured Decoding for Sequence Models
2019 ACL - Non-autoregressive Transformer by Position Learning
2020 - ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
2019 University of Tokyo,FAIR,MILA,NYU Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior

3.2 Non-Autoregressive(NA) ASR (STT)

< Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict >


< Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition >


year conference research organization title model link code
2020 Interspeech - Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict CTC-based
2020 Interspeech - Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition CTC-based
2020 - A Study of Non-autoregressive Model for Sequence Generation

3.3 Non-Autoregressive(NA) Speech Synthesis (TTS)

year conference research organization title model link code
2020 Baidu Research Non-Autoregressive Neural Text-to-Speech



4. End-to-End Spoken Language Understanding


기존의 Spoken Language Understanding (SLU) 는 음성을 입력받아 ASR module이 텍스트를 출력하고, 
이를 입력으로 받은 Natural Language Understanding (NLU) module이 감정(emotion)/의도(intent,slot) 등을 결과로 출력했다.

End-to-End Spoken Language Understanding (SLU)란 음성을 입력으로 받아 direct로 결과를 출력함으로써
음성인식 네트워크가 가지고 있는 에러율에 구애받지 않고 fully differentiable 하게 학습하는 것이 목적이다.

( Conventional Pipeline for Spoken Language Understanding ( ASR -> NLU ) )


( End-to-End Spoken Language Understanding Network )


< Towards End-to-end Spoken Language Understanding >


4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )

  • Intent Classification (IC)
  • Spoken Question Answering (SQA)
  • Emotion Recognition (ER)
task dataset name language year conference title paper link dataset link
- SLURP english 2020 EMNLP SLURP: A Spoken Language Understanding Resource Package paper dataset
IC Fluent Speech Command(FSC) english 2019 Interspeech Speech Model Pre-training for End-to-End Spoken Language Understanding paper dataset
IC SNIPS english 2018 Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces paper
IC ATIS english 1999 The atis spoken language sys- tems pilot corpus paper
IC TOP or Facebook Semantic Parsing System (FSPS) 2019 Semantic Parsing for Task Oriented Dialog using Hierarchical Representations paper
SQA Spoken SQuAD(SSQD) english 2018 Interspeech Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension paper dataset
SQA Spoken CoQA english 2020 - Towards Data Distillation for End-to-end Spoken Conversational Question Answering paper dataset
SQA Odsaqa chinese 20- - Odsqa: Open-domain spoken question answering dataset - -
ER IEMOCAP english 2017 - IEMOCAP: Interactive emotional dyadic motion capture database paper dataset
ER CMU-MOSEI english 2018 - Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph paper dataset

4.2 Intent Classification (IC)

year conference research organization title model link code
2018 ICASSP Facebook, MILA Towards End-to-end Spoken Language Understanding paper
2019 Interspeech MILA,CIFAR Speech Model Pre-training for End-to-End Spoken Language Understanding paper code(official)

4.3 Spoken Question Answering (SQA)

year conference research organization title model link code
2018 Interspeech Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension dataset paper github

4.4 Emotion Recognition (ER)




5. Self-Supervised(or Semi-Supervised) Learning for Speech


Self-Supervised(or Semi-Supervised) Learning 이란 Yann Lecun이 강조했을 만큼 현재 2020년 현재 딥러닝에서 가장 핫 한 주제중 하나이며, 
Label되지 않은 방대한 data를 self-supervised (or semi-supervised) 방법으로 학습하여 입력으로부터 더 좋은 Representation을 찾는 방법이다. 
이렇게 사전 학습(pre-training)된 네트워크는 음성 인식 등 다른 task를 위해 task-specific 하게 미세 조정 (fine-tuning)하여 사용한다.

사전 학습 방법은 AutoEncoder 부터 BERT 까지 다양한 방법으로 기존에 존재했으나 음성에 맞는 방식으로 연구된 논문들이 최근에 제시되어 왔으며, 
이렇게 학습된 네트워크는 scratch 부터 학습한 네트워크보다 더욱 높은 성능을 자랑한다 .

< wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations >


year conference research organization title link code
2019 - Facebook AI Research (FAIR) wav2vec: Unsupervised Pre-training for Speech Recognition paper code(official)
2019 - FAIR Unsupervised Cross-lingual Representation Learning at Scale
2019 ICLR FAIR vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations paper code(official)
2020 - FAIR wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper code(official)
2020 - FAIR Unsupervised Cross-lingual Representation Learning for Speech Recognition paper
2019 - Deepmind Learning robust and multilingual speech representations paper
- - SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering paper
- - Self-Supervised Representations Improve End-to-End Speech Translation paper
- - Unsupervised Pretraining Transfers Well Across Languages
- - Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
- - Learning robust and multilingual speech representations
- - Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN
2020 - MIT CSAIL SEMI-SUPERVISED SPEECH-LANGUAGE JOINT PRE- TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING paper

About

awesome_speech_papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published