2021. 10. 24 현재 미완성입니다.
시간내서 완성하기가 쉽지 않네요... 하지만 누군가에게 조금이라도 도움이 됐으면 좋겠다는 마음으로 오픈해두겠습니다 ㅠㅠ
This repository is for those who want to study Speech tasks such as Speech Recognition, Speecn Synthesis, Spoken Language Understanding and so on.
I did not try to survey as many papers as possible but the crucial papers (especially recently published papers) by my standards.
For Koreans)
이 페이지는 음성 관련 task (음성 인식, 음성 합성 등)를 공부 및 연구하고 싶은 분들을 위해 만들어졌습니다.
최대한 페이퍼를 많이 포함하기 보다는 중요하고(citation이 충분히 높고, 신뢰할 만한 기관에서 수행했다거나 등등)
가능한 최신 페이퍼들을 포함하려고 합니다. (주관적일 수 있음)
Fig. Overall Speech Dialogue System From Seunghyun SEO
- 1. Learnable Front-End for Speech
- 2. Self-Supervised(or Semi-Supervised) Learning for Speech
- 3. End-to-End Speech Recognition
- 3.1 CTC based ASR model
- 3.2 Seq2Seq with Attention based ASR model
- 3.3 CTC & Attention Hybrid Model
- 3.4 Neural Transducer(RNN-T) based ASR model
- 3.5 Streaming ASR
- 3.6 ASR Rescoring / Spelling Correction
- 4. End-to-End Spoken Language Understanding
- 4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )
- 4.2 Intent Classification (IC) + (Named Entity Recognition (NER) or Slot Filling (SF))
- 4.3 Spoken Question Answering (SQA)
- 4.4 Speech Emotion Recognition (SER)
- 5. End-to-End Speech Synthesis
- 6. End-to-End Non-Autoregressive Sequence Generation Model
- 6.1 Non-Autoregressive(NA) NMT
- 6.2 Non-Autoregressive(NA) ASR (STT)
- 6.3 Non-Autoregressive(NA) Speech Synthesis (TTS)
- 7. Some Trivial Schemes for Speech Tasks
- TBC
- Voice Conversion
- Speaker Identification
- MIR ?
- Rescoring
- Speech Translation
일바적은 음성 관련 task의 입력값은 보통 Short Time Fourier Transform과 Mel filter bank등을 이용한 (Mel) 스펙트로그램, MFCC 같은
domain knowledge가 반영된 feature들이었습니다.
하지만 최근에 제안된 기법들은(시도는 계속 있어왔음) raw speech signal에서부터 곧바로 feature를 추출하는 방식들이며,
이 feature를 추출하는 파라메터는 학습을 통해서 찾게 되며, 이렇게 찾은 feature들은 다양한 음성 task에 대해 성능적인 측면에서 우수함을 증명하고 있습니다.
Fig. Conventional Front-End feature, Spectrogram using Short-Time-Fourier-Transform(STFT)
Fig. Interpretable Convolutional Filters with SincNet, 2018
Fig. LEAF: A Learnable Frontend for Audio Classification, 2021
year | conference | research organization | title | link | code |
---|---|---|---|---|---|
2013 | ASRU | Learning filter banks within a deep neural network framework | paper | ||
2015 | Interspeech | Learning the Speech Front-end With Raw Waveform CLDNNs | paper | ||
2015 | ICASSP | Hebrew University of Jerusalem, Google | Speech acoustic modeling from raw multichannel waveforms | paper | |
2018 | ICASSP | Facebook AI Research (FAIR), CoML | Learning Filterbanks from Raw Speech for Phone Recognition | paper | code(pytorch, official) |
2018 | - | MILA | Interpretable Convolutional Filters with SincNet | paper | code(official) |
2018 | SLT | MILA | Speaker recognition from raw waveform with sincnet | paper | code(official) |
2021 | ICLR | LEAF: A Learnable Frontend for Audio Classification | paper |
- if you are new to SSL, you'd better read this blog article first : lillog post, Amit Chaudhary's post
Self-Supervised(or Semi-Supervised) Learning 이란 Yann Lecun이 강조했을 만큼 현재 2020년 현재 딥러닝에서 가장 핫 한 주제중 하나이며,
Label되지 않은 방대한 data를 self-supervised (or semi-supervised) 방법으로 학습하여 입력으로부터 더 좋은 Representation을 찾는 방법입니다.
이렇게 사전 학습(pre-training)된 네트워크는 음성 인식 등 다른 task를 위해 task-specific 하게 미세 조정 (fine-tuning)하여 사용합니다.
사전 학습 방법은 AutoEncoder 부터 BERT 까지 다양한 방법으로 기존에 존재했으나 음성에 맞는 방식으로 연구된 논문들이 최근에 제시되어 왔으며,
이렇게 학습된 네트워크는 scratch 부터 학습한 네트워크보다 더욱 높은 성능을 자랑합니다.
Fig. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020
year | conference | research organization | title | link | code |
---|---|---|---|---|---|
2019 | Facebook AI Research (FAIR) | Effectiveness of self-supervised pre-training for speech recognition | paper | ||
2019 | Interspeech | Facebook AI Research (FAIR) | wav2vec: Unsupervised Pre-training for Speech Recognition | paper | code(offiial, pytorch) |
2020 | ACL | Facebook AI Research (FAIR) | Unsupervised Cross-lingual Representation Learning at Scale | paper | code(offiial, pytorch) |
2020 | ICLR | Facebook AI Research (FAIR) | vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | paper | code(offiial, pytorch) |
2020 | NIPS | Facebook AI Research (FAIR) | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | paper | code(offiial, pytorch) |
2020 | - | Facebook AI Research (FAIR) | Unsupervised Cross-lingual Representation Learning for Speech Recognition | paper | code(offiial, pytorch) |
2020 | Interspeech | Facebook AI | Self-Supervised Representations Improve End-to-End Speech Translation | paper | |
2020 | ICASSP | Facebook AI Research (FAIR) | Unsupervised Pretraining Transfers Well Across Languages | paper | |
<= | => | ||||
2019 | Universitat Polite cnica de Catalunya | Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN | paper | ||
2019 | Interspeech | Universitat Politècnica de Catalunya, MILA et al. | Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks | paper | code(official) |
2020 | ICASSP | MILA et al. | MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION | paper | code(official) |
<= | => | ||||
2018 | - | Deepmind | Representation Learning with Contrastive Predictive Coding | paper | code(offiial, pytorch) |
2019 | - | Deepmind | Learning robust and multilingual speech representations | paper | |
2020 | Interspeech | National Taiwan University | SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering | paper | |
2020 | DeepMind, University of Oxford | Learning robust and multilingual speech representations | paper | ||
2020 | MIT CSAIL | SEMI-SUPERVISED SPEECH-LANGUAGE JOINT PRE- TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING | paper | ||
2021 | MIT CSAIL | Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining | paper | code(official, pytorch) | |
2021 | Facebook AI | Generative Spoken Language Modeling from Raw Audio | paper | ||
2020 | ICASSP | University of Oxford, Naver | Disentangled Speech Embeddings using Cross-modal Self-supervision | paper |
I recommend you to read graves' thesis : Supervised Sequence Labelling with Recurrent Neural Networks, 2008
-
If you're new to CTC-based ASR model, you'd better see this blog before reading papers : post for CTC from Distill blog
-
One of the most important issues in Sequence Generation tasks such as Automatic Speech Recognition (ASR) and Optical Character Recognition(OCR) is alingment problem that map input sequence into target sequence. In 2006, Connectionist Temporal Classification (CTC) was proposed by Alex Graves, the researcher of Deepmind. The proposed CTC Loss is designed to deal with mentioned alignment problem. This is one of the most popular method in ASR along with Seq2Seq method.
음성인식(Automatic Speech Recognition), 활자인식(Optical Character Recognition, OCR) 등의 Sequence Generation task의
주요 문제점 중 하나는 바로 정렬(alignment) 문제입니다.
이는 음성인식을 예로 들자면, 음성(입력 데이터) 와 이에 대응하는 문장(맞춰야할 정답)의 sequence길이가 서로 다르기 떄문에
어디서 부터 어디까지가 토큰(단어, 문자)에 매핑되는지 알 수 없는 문제를 이야기합니다.
2006년 Alex Graves에 의해 제안된 Connectionist Temporal Classification 논문에서
제안된 CTC loss는 바로 이를 해결하기 위해 제안된 방법 이며,
이는 음성인식에서 1.2 section의 Attention 을 활용한 Seq2Seq 기법과 쌍벽을 이루는 기법입니다.
The figure shows the frame-level character probabilities emitted by the CTC layer
Fig. Towards End-to-End Speech Recognition with Recurrent Neural Networks, 2014
Fig. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin, 2016
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2006 | ICML | University of Toronto | Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks | CTC | paper | code(pytorch),warp-ctc,code(pytorch) |
2014 | ICML | Deepmind | Towards End-To-End Speech Recognition with Recurrent Neural Network | LSTM-based CTC model | paper | |
2014 | Baidu Research | Deep speech: Scaling up end-to-end speech recognition | paper | code(tensorflow),code(pytorch) | ||
2016 | ICML | Baidu Research | Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin | CNN-based CTC model | paper | code(pytorch) |
2014 | Baidu Research | (Deep Speech 3) Exploring Neural Transducers for End-to-End Speech Recognition | paper | |||
2016 | Facebook AI Research (FAIR) | Wav2Letter: an End-to-End ConvNet-based Speech Recognition System | CNN-based CTC model | paper | code(official pytorch, C++) | |
2018 | STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS | paper | ||||
2019 | Interspeech | Nvidia | Jasper: An End-to-End Convolutional Neural Acoustic Model | CNN-based CTC model | paper | code(official),code(pytorch) |
2019 | Nvidia | Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions | paper |
-
If you're new to seq2seq with attention network, you'd better check following things
-
As i mentioned before, Seq2Seq with Attention network is also the most popular method in Sequence Generation task such as ASR, OCR, NMT and so on. Although this method has a problem with producing sequences conditined by entire input sequence (not good for real time), it has achieved high scores on many ASR benchmark datasets.
Attention 을 활용한 Seq2Seq ASR 네트워크는, 2014년에 제안된 기계번역 분야의 breakthrough 였던
'Neural Machine Translation by Jointly Learning to Align and Translate'논문과 굉장히 유사한 네트워크로,
CTC와 마찬가지로 음성인식에서의 alignment 문제를 획기적으로 해결한 방법입니다.
이는 Auto-regressive하게 디코딩한다는 문제점이 존재하기는 하지만 가장 강력한 성능을 내는 End-to-End 기법 중 하나입니다.
과거 HMM-GMM, HMM-DNN 모델의 음향 모델(Acoustic Model, AM), 언어 모델(Language Model, LM)등의 역할을
Seq2Seq 모델의 인코더(Encoder), 디코더(Decoder)가 한다고 알려져 있습니다.
Fig. Listen, Attend and Spell
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2015 | NIPS | University of Wrocław, Jacobs University Bremen, Universite ́ de Montre ́al et al. | Attention-Based Models for Speech Recognition | Seq2Seq with Attention | paper | code(pytorch, code2(pytorch |
2015 | ICASSP | Listen, Attend and Spell | Seq2Seq with Attention | paper | code(pytorch) | |
2016 | ICASSP | Jacobs University Bremen, University of Wrocław, Universite ́ de Montre ́al, CIFAR Fellow | End-to-End Attention-based Large Vocabulary Speech Recognition | Seq2Seq with Attention | paper | |
2018 | ICASSP | Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition | Seq2Seq with Attention | paper | code(official),another ref code | |
2019 | ASRU | A Comparative Study on Transformer vs RNN in Speech Applications | Seq2Seq with Attention | paper | ||
2019 | - | End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | Training either CTC or Seq2Seq loss functions | paper |
CTC loss와 Seq2Seq loss를 둘 다 사용하여(jointly) 모델링한 이 네트워크는
앙상블 효과를 누리는 느낌으로(?) End-to-End 음성인식 네트워크를 학습을 더욱 잘되게 합니다.
보통 CTC loss와 Seq2Seq loss를 합이 1이되게 interporation 하며, 학습 시간이 지날수록 이 비율을 바꾸며(sceheduling) 학습합니다.
Fig. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, 2017
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2017 | Hybrid CTC/Attention Architecture for End-to-End Speech Recognition | paper | ||||
2017 | Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning | paper | code(pytorch) | |||
2019 | Transformer-based Acoustic Modeling for Hybrid Speech Recognition | paper |
- you'd better read this blog article first : Google AI Blog for RNN-Transducer
Neural Transducer(RNN-T)의 개념은 Alex Graves에 의해서 'Sequence Transduction with Recurrent Neural Networks'라는 제목의 논문으로
처음 소개되었습니다.
종단간(End-to-End) 음성인식(ASR) 모델들은 그동안 CTC loss나 Seq2seq loss를 활용한 RNN 기반 다양한 모델들이 있었지만,
이들은 전체 음성 입력을 받아 문장을 예측한다던가, 모두 실시간(Real-time or Streaming) 음성인식에 적합하지 않았고
이를 해결하기 위해 제안된 개념이 바로 Neural Transducer(RNN-T)입니다.
RNN 네트워크는 물론 최근 NLP뿐 아니라 CV에서도 연일 최고성능(SOTA)을 갈아치우고 있는 Transformer로 대체할 수 있습니다.
Fig. Neural Transducer
Fig. Streaming E2E Speech Recognition For Mobile Devices, 2018
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2012 | ICML | University of Toronto | Sequence Transduction with Recurrent Neural Networks | paper | ||
2015 | NIPS | Google Brain, Deepmind, OpenAI | A Neural Transducer | paper | ||
2017 | ASRU | Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer | paper | |||
2018 | ICASSP | Streaming E2E Speech Recognition For Mobile Devices | paper | code(tensorflow) | ||
2019 | ASRU | Microsoft | Improving RNN Transducer Modeling for End-to-End Speech Recognition | paper | ||
2019 | Interspeech | Chinese Academy of Sciences et al. | Self-Attention Transducers for End-to-End Speech Recognition | paper | ||
2020 | ICASSP | Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss | paper | code(pytorch) | ||
2020 | ICASSP | A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency | paper | |||
2020 | Interspeech | ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | CNN based RNN-T | paper | ||
2020 | Interspeech | Conformer: Convolution-augmented Transformer for Speech Recognition | paper | code(pytorch), code2(pytorch) | ||
2021 | ICASSP | FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization | paper | |||
2021 | ICASSP | Facebook AI | Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer | paper |
사실 1.4의 RNN-T가 곧 Straeming ASR을 위해 디자인 되었는데 그게 그거 아니냐 라고 할 수도 있지만,
RNN-T 이외에도, 어텐션 기반 seq2seq모델만으로 하려는 시도가 있었고, seq2seq 와 RNN-T를 합친 모델 등도 있었기 때문에
따로 빼서 서브섹션을 하나 더 만들었습니다.
Fig. Two-Pass End-to-End Speech Recognition, 2019
Fig. Streaming automatic speech recognition with the transformer model, 2020
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | ICLR | Google Brain | Monotonic Chunkwise Attention | Seq2Seq with Attention | paper | |
2019 | Interspeech | Two-Pass End-to-End Speech Recognition | LAS+RNN-T | paper | ||
2019 | Interspeech | Samsung Research | END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM | paper | ||
2020 | ICASSP | MERL | Streaming automatic speech recognition with the transformer model | paper | ||
2020 | Interspeech | Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition | paper | |||
2021 | ICLR | Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling | paper |
tmp
- This is from link
year | conference | research organization | title | model | task | link | code |
---|---|---|---|---|---|---|---|
2019 | ICASSP | University of California, Los Angeles, Google | A Spelling Correction Model For E2E Speech Recognition | LAS based | paper | ||
2019 | ACML | Seoul National University(SNU) | Effective Sentence Scoring Method Using BERT for Speech Recognition | BERT based | asr | paper | |
2020 | ICASSP | Moscow Institute of Physics and Technology, NVIDIA | Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model | Transformer based |
Spoken Language Understanding (SLU)는 speech dialog system의 front-end 입니다.
기존의 SLU pipeline은 음성을 입력받아 ASR 네트워크가 텍스트를 출력하고,
이를 입력으로 받은 Natural Language Understanding (NLU) 네트워크가 감정(emotion)/의도(intent,slot) 등의 semantic information을 추출했습니다.
하지만 이런 pipeline은 치명적인 단점을 가지고 있는데요 바로 ASR 네트워크가 출력한 문장에 에러가 포함되어 있을 수 있고,
이럴 경우 NLU입장에서 이는 이해할 수 없기 때문에 형편없는 결과를 추출할 수 밖에 없다는 것입니다.
End-to-End Spoken Language Understanding (E2E SLU)란 음성을 입력으로 받아 direct로 결과를 출력함으로써
음성인식 네트워크가 가지고 있는 에러율에 구애받지 않고 semantic information을 뽑는 기법으로 최근에 활발히 연구가 진행되고 있는 분야입니다.
( Conventional Pipeline for Spoken Language Understanding ( ASR -> NLU ) )
( End-to-End Spoken Language Understanding Network )
Fig. Towards End-to-end Spoken Language Understanding, 2018
SLU 분야는 꾸준히 연구되어 오긴 했으나, E2E SLU는 최근에서야 활발히 연구되고 있는 분야입니다.
그렇기 때문에 입력이 텍스트가 아닌 음성으로 되어있는 (예를 들어, Speech - Intent pair) 데이터가 부족합니다.
따라서 연구를 하실 때 괜찮은 public dataset을 찾기 힘들기 때문에 이와 관련된 데이터들부터 리스트업 해보고자 합니다.
- Intent Classification (IC) + (Named Entity Recognition (NER) or Slot Filling (SF))
- Spoken Question Answering (SQA)
- Speech Emotion Recognition (SER)
task | dataset name | language | year | conference | title | paper link | dataset link |
---|---|---|---|---|---|---|---|
- | SLURP | english | 2020 | EMNLP | SLURP: A Spoken Language Understanding Resource Package | paper | dataset |
IC | Fluent Speech Command(FSC) | english | 2019 | Interspeech | Speech Model Pre-training for End-to-End Spoken Language Understanding | paper | dataset |
IC | SNIPS | english | 2018 | Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces | paper | dataset | |
IC | ATIS | english | 1999 | ACL | The atis spoken language sys- tems pilot corpus | paper | dataset |
IC | TOP or Facebook Semantic Parsing System (FSPS) | english | 2019 | Semantic Parsing for Task Oriented Dialog using Hierarchical Representations | paper | ||
SQA | Spoken SQuAD(SSQD) | english | 2018 | Interspeech | Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension | paper | dataset |
SQA | Spoken CoQA | english | 2020 | - | Towards Data Distillation for End-to-end Spoken Conversational Question Answering | paper | dataset |
SQA | Odsqa | chinese | 20- | - | Odsqa: Open-domain spoken question answering dataset | - | - |
ER | IEMOCAP | english | 2017 | - | IEMOCAP: Interactive emotional dyadic motion capture database | paper | dataset |
ER | CMU-MOSEI | english | 2018 | - | Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph | paper | dataset |
year | conference | research organization | title | model | task | link | code |
---|---|---|---|---|---|---|---|
2018 | ICASSP | Facebook, MILA | Towards End-to-end Spoken Language Understanding | IC only | paper | ||
2019 | Interspeech | MILA,CIFAR | Speech Model Pre-training for End-to-End Spoken Language Understanding | IC only | paper | code(official) |
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | Interspeech | Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension | dataset | paper | github |
year | conference | research organization | title | model | link | code |
---|
Fig. WaveNet: A Generative Model for Raw Audio, 2016
Fig. Tacotron: Towards End-to-End Speech Synthesis, 2017
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2016 | Deepmind | WaveNet: A Generative Model for Raw Audio | paper | code(tensorflow),code(pytorch) | ||
2018 | ICML | Deepmind | Parallel WaveNet: Fast High-Fidelity Speech Synthesis | paper | ||
2017 | ICLR | University of Montreal et al. | SampleRNN: An Unconditional End-to-End Neural Audio Generation Model | paper | code(official) | |
2017 | ICLR | Montreal Univ, CIFAR | Char2Wav: End-to-End Speech Synthesis | paper | ||
<= | => | |||||
2017 | ICML | Baidu Research | Deep Voice: Real-time Neural Text-to-Speech | DeepVoice Series | paper | |
2017 | NIPS | Baidu Research | Deep Voice 2: Multi-Speaker Neural Text-to-Speech | DeepVoice Series | paper | |
2018 | ICLR | Baidu Research | Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning | DeepVoice Series | paper | code |
<= | => | |||||
2017 | Interspeech | Tacotron: Towards End-to-End Speech Synthesis | Tacotron Series | paper | code(tensorflow), code(pytorch) | |
2017 | NIPS | KAIST et al. | Emotional End-to-End Neural Speech Synthesizer | Tacotron Series | paper | |
2018 | ICML | Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron | Tacotron Series | paper | code(tensorflow) | |
2018 | ICML | Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | Tacotron Series | paper | ||
2018 | ICASSP | Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2) | Tacotron Series | paper | ||
2021 | ICLR | Google Research | Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling | Tacotron Series | paper | |
<= | => | |||||
2019 | ICLR | UC San Diego | Adversarial Audio Synthesis | GAN | paper | code(official, tensorflow) |
2020 | ICASSP | LINE, NAVER | Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | GAN | paper | code(official) |
<= | => | |||||
2019 | AAAI | University of Electronic Science and Technology of China et al. | Neural Speech Synthesis with Transformer Network | paper | ||
2019 | NIPS | NVIDIA | FastSpeech: Fast, Robust and Controllable Text to Speech | paper | code(pytorch) | |
2021 | ICLR | NVIDIA | FastSpeech 2: Fast and High-Quality End-to-End Text to Speech | paper | ||
<= | => | |||||
2019 | ICASSP | Nvidia | WaveGlow: a Flow-based Generative Network for Speech Synthesis | Flow-based | paper | code(official, pytorch) |
2020 | NIPS | Kakao Enterprise, SNU | Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search | Flow-based | paper | |
<= | => | |||||
2019 | ICLR | Baidu Research | ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech | paper | ||
2020 | ICML | Baidu Research | Non-Autoregressive Neural Text-to-Speech | paper | ||
2020 | ICASSP | Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis | paper |
일반적인 End-to-End 음성인식 모델의 단점 중 하나인 Autoregressive decoding 방법을 해결하기 위한 기법들이 최근 제안되고 있습니다.
하지만 Non-Autoregressive 음성 인식 모델은 논문이 별로 없기 때문에 기계번역(NMT)/음성인식(STT)/음성합성(STT) 모두 포함하려고 .
Fig. NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION, 2018
Fig. Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior, 2020
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2018 | ICLR | The University of Hong Kong | NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION | paper | code(fairseq) | |
2018 | ACL | NYU | Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement | paper | code(official), code(fairseq) | |
2019 | NIPS | Facebook AI Research (FAIR) | Levenshtein Transformer | paper | code(official, fairseq) | |
2019 | ACL | Nanjing University et al. | Non-autoregressive Transformer by Position Learning | paper | ||
2019 | NIPS | CMU,Berkeley,Peking University | Fast Structured Decoding for Sequence Models | paper | code(fairseq) | |
2020 | ACL | Non-Autoregressive Machine Translation with Latent Alignments | paper | code | ||
2019 | EMNLP | CMU, Facebook AI | FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | paper | code(official) | |
2020 | ACL | Toyota Technological Institute at Chicago et al. | ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation | paper | ||
2020 | AAAI | University of Tokyo, FAIR, MILA, NYU | Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior | paper | code(official, pytorch) |
Fig. Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict, 2020
Fig. Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition, 2020
year | conference | research organization | title | model | link | code |
---|---|---|---|---|---|---|
2020 | Interspeech | Johns Hopkins University et al. | Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict | CTC-based | paper | |
2020 | Interspeech | Chinese Academy of Sciences et al. | Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition | CTC-based | paper | |
2020 | ACL | Zhejiang University | A Study of Non-autoregressive Model for Sequence Generation | paper |
year | conference | research organization | title | model | link | code |
---|
Fig. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition, 2019
Fig. When does label smoothing help?, 2019
year | conference | research organization | title | link | code |
---|---|---|---|---|---|
2017 | ACL | Facebook AI Research (FAIR) | Bag of Tricks for Efficient Text Classification | paper | code(official) |
2017 | ICLR | Google Brain, University of Toronto | Regularizing Neural Networks by Penalizing Confident Output Distributions | paper | - |
2018 | ICLR | Google Brain | Don't decay the learning rate, Increase the batch size | paper | code(pytorch) |
2019 | NIPS | Google Brain, University of Toronto | When does label smoothing help? | paper | code(pytorch) |
2019 | Interspeech | Google Brain | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | paper | code, code2 |