This repository is for those who want to study or research Speech tasks ( Speech Recognition, Speecn Synthesis so on).
이 페이지는 음성 관련 task (음성 인식, 음성 합성 등)를 공부 및 연구하고 싶은 newbie들을 위해 만들어짐.
최대한 페이퍼를 많이 포함하기 보다는 중요하고(citation이 충분히 높고, 신뢰할 만한 기관에서 수행했으며,
top 컨퍼런스/에 publish된 페이퍼 위주) 최신자 페이퍼들만 포함하려고 함.(주관적일 수 있음)
갑자기 잡동사니가 되었습니다.
- don't decay the learning rate, increase the batch size, paper
- when does label smoothing help? paper
- Bag of Tricks for Efficient Text Classification paper
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition paper
- State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions paper
-
1.End-to-End Speech Recognition papers
- CTC-based ASR papers
- Attention-based ASR papers
- Hybrid ASR papers
- RNN-T based ASR papers
- Streaming ASR papers
-
2.End-to-End Speech Synthesis papers
-
3.End-to-End Non-Autoregressive Sequence Generation papers
- ASR
- NMT
- TTS
-
4.End-to-End Spoken Language Understanding
- Intent Classification papers
- Spoken Question Answering papers
- Speech Emotion Recognition papers
-
5.Self-Supervised(or Semi-Supervised) Learning for Speech
-
TBC
- Voice Conversion
- Speaker Identification
- MIR ?
- Rescoring
- Speech Translation
- If you're new to CTC-based ASR model, you'd better see this blog before reading papers : post for CTC from Distill blog
< Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2006 | ICML | Toronto University | Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks | CTC | paper | |
2014 | Deep speech: Scaling up end-to-end speech recognition | |||||
2016 | ICML | Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin | CTC-based CNN model | paper | code(pytorch) | |
2019 | Interspeech | Nvidia | Jasper: An End-to-End Convolutional Neural Acoustic Model | |||
2019 | Nvidia | Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions |
- If you're new to seq2seq with attention network, you'd better check following things
< Listen, Attend and Spell >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2008 | Supervised Sequence Labelling with Recurrent Neural Networks | |||||
2014 | ICML | Towards End-to-End Speech Recognition with Recurrent Neural Networks | ||||
2015 | NIPS | Attention-Based Models for Speech Recognition | Seq2Seq | |||
2015 | ICASSP | Listen, Attend and Spell | Seq2Seq | paper | code(pytorch) | |
2016 | End-to-End Attention-based Large Vocabulary Speech Recognition | |||||
2017 | ICLR | Monotonic Chunkwise Attention | ||||
2018 | ICASSP | Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition | ||||
2019 | Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR | |||||
2019 | A Comparative Study on Transformer vs RNN in Speech Applications | paper | ||||
2019 | End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | paper | ||||
2020 | Conformer: Convolution-augmented Transformer for Speech Recognition | paper |
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2019 | Transformer-based Acoustic Modeling for Hybrid Speech Recognition | paper |
< Streaming E2E Speech Recognition For Mobile Devices >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2012 | Sequence Transduction with Recurrent Neural Networks | |||||
2018 | ICASSP | Streaming E2E Speech Recognition For Mobile Devices | paper | |||
2018 | Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer | |||||
2019 | Improving RNN Transducer Modeling for End-to-End Speech Recognition | |||||
2019 | - | Self-Attention Transducers for End-to-End Speech Recognition | ||||
2020 | ICASSP | - | Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss | |||
2020 | ICASSP | - | A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency | |||
2021 | ICASSP | - | FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization | |||
2021 | ICASSP | - | Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer | |||
2020 | ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | paper |
< Two-Pass End-to-End Speech Recognition >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2019 | Two-Pass End-to-End Speech Recognition | LAS+RNN-T | paper |
temporal
- This is from link
year | conference | research organization | title | model | task | link | code |
---|---|---|---|---|---|---|---|
2019 | Automatic Speech Recognition Errors Detection and Correction | ||||||
2019 | A Spelling Correction Model For E2E Speech Recognition | ||||||
2019 | An Empirical Study Of Efficient ASR Rescoring With Transformers | ||||||
2019 | Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition | ||||||
2019 | Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model | ||||||
2019 | Effective Sentence Scoring Method Using BERT for Speech Recognition | asr | |||||
2019 | Spelling Error Correction with Soft-Masked BERT | nlp | |||||
2019 | Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition | asr |
< Tacotron: Towards End-to-End Speech Synthesis >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2016 | Deepmind | WaveNet: A Generative Model for Raw Audio | paper | |||
2017 | ICLR | - | SampleRNN: An Unconditional End-to-End Neural Audio Generation Model | paper | code(official) | |
2017 | ICLR | Montreal Univ, CIFAR | Char2Wav: End-to-End Speech Synthesis | paper | ||
2017 | PMLR | Baidu Research | Deep Voice: Real-time Neural Text-to-Speech | paper | ||
2017 | NIPS | Baidu Research | Deep Voice 2: Multi-Speaker Neural Text-to-Speech | paper | ||
2017 | Baidu Research | Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning | paper | code | ||
2017 | Tacotron: Towards End-to-End Speech Synthesis | paper | code(tensorflow), code(pytorch) | |||
2017 | ICML | Emotional End-to-End Neural Speech Synthesizer | ||||
2018 | ICML | Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron | ||||
2018 | ICML | Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | ||||
2021 | ICLR | Google Research | Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling | paper | ||
2018 | Adversarial Audio Synthesis | GAN | paper | code(official, tensorflow) | ||
2019 | ICASSP | Nvidia | WaveGlow: a Flow-based Generative Network for Speech Synthesis | paper | code(official, pytorch) | |
2019 | Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | paper | ||||
2019 | NIPS | NVIDIA | FastSpeech: Fast, Robust and Controllable Text to Speech | paper | ||
2020 | - | NVIDIA | FastSpeech 2: Fast and High-Quality End-to-End Text to Speech | paper | ||
2020 | NIPS | Kakao Enterprise, SNU | Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search | paper | ||
2020 | ICASSP | Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow | paper | |||
2019 | AAAI | Neural Speech Synthesis with Transformer Network | paper | |||
2017 | Parallel WaveNet: Fast High-Fidelity Speech Synthesis | |||||
2018 | - | WaveGlow: A Flow-based Generative Network for Speech Synthesis | ||||
2020 | ICASSP | Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis |
Non-Autoregressive 모델은 논문이 별로 없기 때문에 기계번역(NMT)/음성인식(STT)/음성합성(STT) 모두 포함하려고 함.
< NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION >
< Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | ICLR | The University of Hong Kong | NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION | |||
2020 | Non-Autoregressive Machine Translation with Latent Alignments | |||||
2020 | CMU | FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ||||
2020 | CMU,Berkeley,Peking University | Fast Structured Decoding for Sequence Models | ||||
2019 | ACL | - | Non-autoregressive Transformer by Position Learning | |||
2020 | - | ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation | ||||
2019 | University of Tokyo,FAIR,MILA,NYU | Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior |
< Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict >
< Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition >
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2020 | Interspeech | - | Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict | CTC-based | ||
2020 | Interspeech | - | Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition | CTC-based | ||
2020 | - | A Study of Non-autoregressive Model for Sequence Generation |
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2020 | Baidu Research | Non-Autoregressive Neural Text-to-Speech |
기존의 Spoken Language Understanding (SLU) 는 음성을 입력받아 ASR module이 텍스트를 출력하고,
이를 입력으로 받은 Natural Language Understanding (NLU) module이 감정(emotion)/의도(intent,slot) 등을 결과로 출력했다.
End-to-End Spoken Language Understanding (SLU)란 음성을 입력으로 받아 direct로 결과를 출력함으로써
음성인식 네트워크가 가지고 있는 에러율에 구애받지 않고 fully differentiable 하게 학습하는 것이 목적이다.
( Conventional Pipeline for Spoken Language Understanding ( ASR -> NLU ) )
( End-to-End Spoken Language Understanding Network )
< Towards End-to-end Spoken Language Understanding >
- Intent Classification (IC)
- Spoken Question Answering (SQA)
- Emotion Recognition (ER)
task | dataset name | language | year | conference | title | paper link | dataset link |
---|---|---|---|---|---|---|---|
- | SLURP | english | 2020 | EMNLP | SLURP: A Spoken Language Understanding Resource Package | paper | dataset |
IC | Fluent Speech Command(FSC) | english | 2019 | Interspeech | Speech Model Pre-training for End-to-End Spoken Language Understanding | paper | dataset |
IC | SNIPS | english | 2018 | Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces | paper | ||
IC | ATIS | english | 1999 | The atis spoken language sys- tems pilot corpus | paper | ||
IC | TOP or Facebook Semantic Parsing System (FSPS) | 2019 | Semantic Parsing for Task Oriented Dialog using Hierarchical Representations | paper | |||
SQA | Spoken SQuAD(SSQD) | english | 2018 | Interspeech | Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension | paper | dataset |
SQA | Spoken CoQA | english | 2020 | - | Towards Data Distillation for End-to-end Spoken Conversational Question Answering | paper | dataset |
SQA | Odsaqa | chinese | 20- | - | Odsqa: Open-domain spoken question answering dataset | - | - |
ER | IEMOCAP | english | 2017 | - | IEMOCAP: Interactive emotional dyadic motion capture database | paper | dataset |
ER | CMU-MOSEI | english | 2018 | - | Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph | paper | dataset |
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | ICASSP | Facebook, MILA | Towards End-to-end Spoken Language Understanding | paper | ||
2019 | Interspeech | MILA,CIFAR | Speech Model Pre-training for End-to-End Spoken Language Understanding | paper | code(official) |
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | Interspeech | Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension | dataset | paper | github |
Self-Supervised(or Semi-Supervised) Learning 이란 Yann Lecun이 강조했을 만큼 현재 2020년 현재 딥러닝에서 가장 핫 한 주제중 하나이며,
Label되지 않은 방대한 data를 self-supervised (or semi-supervised) 방법으로 학습하여 입력으로부터 더 좋은 Representation을 찾는 방법이다.
이렇게 사전 학습(pre-training)된 네트워크는 음성 인식 등 다른 task를 위해 task-specific 하게 미세 조정 (fine-tuning)하여 사용한다.
사전 학습 방법은 AutoEncoder 부터 BERT 까지 다양한 방법으로 기존에 존재했으나 음성에 맞는 방식으로 연구된 논문들이 최근에 제시되어 왔으며,
이렇게 학습된 네트워크는 scratch 부터 학습한 네트워크보다 더욱 높은 성능을 자랑한다 .
< wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations >
year | conference | research organization | title | link | code |
---|---|---|---|---|---|
2019 | - | Facebook AI Research (FAIR) | wav2vec: Unsupervised Pre-training for Speech Recognition | paper | code(official) |
2019 | - | FAIR | Unsupervised Cross-lingual Representation Learning at Scale | ||
2019 | ICLR | FAIR | vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | paper | code(official) |
2020 | - | FAIR | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | paper | code(official) |
2020 | - | FAIR | Unsupervised Cross-lingual Representation Learning for Speech Recognition | paper | |
2019 | - | Deepmind | Learning robust and multilingual speech representations | paper | |
- | - | SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering | paper | ||
- | - | Self-Supervised Representations Improve End-to-End Speech Translation | paper | ||
- | - | Unsupervised Pretraining Transfers Well Across Languages | |||
- | - | Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks | |||
- | - | Learning robust and multilingual speech representations | |||
- | - | Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN | |||
2020 | - | MIT CSAIL | SEMI-SUPERVISED SPEECH-LANGUAGE JOINT PRE- TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING | paper |