Skip to content

JUiscoming/awesome_speech_papers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

About This Repository

2021. 10. 24 현재 미완성입니다.

시간내서 완성하기가 쉽지 않네요... 하지만 누군가에게 조금이라도 도움이 됐으면 좋겠다는 마음으로 오픈해두겠습니다 ㅠㅠ

This repository is for those who want to study Speech tasks such as Speech Recognition, Speecn Synthesis, Spoken Language Understanding and so on.
I did not try to survey as many papers as possible but the crucial papers (especially recently published papers) by my standards.



For Koreans)
이 페이지는 음성 관련 task (음성 인식, 음성 합성 등)를 공부 및 연구하고 싶은 분들을 위해 만들어졌습니다.
최대한 페이퍼를 많이 포함하기 보다는 중요하고(citation이 충분히 높고, 신뢰할 만한 기관에서 수행했다거나 등등) 가능한 최신 페이퍼들을 포함하려고 합니다. (주관적일 수 있음)


Fig. Overall Speech Dialogue System From Seunghyun SEO


Index


  • TBC
    • Voice Conversion
    • Speaker Identification
    • MIR ?
    • Rescoring
    • Speech Translation



1. Learnable Front-End for Speech

일바적은 음성 관련 task의 입력값은 보통 Short Time Fourier Transform과 Mel filter bank등을 이용한 (Mel) 스펙트로그램, MFCC 같은 
domain knowledge가 반영된 feature들이었습니다.
하지만 최근에 제안된 기법들은(시도는 계속 있어왔음) raw speech signal에서부터 곧바로 feature를 추출하는 방식들이며, 
이 feature를 추출하는 파라메터는 학습을 통해서 찾게 되며, 이렇게 찾은 feature들은 다양한 음성 task에 대해 성능적인 측면에서 우수함을 증명하고 있습니다.

Fig. Conventional Front-End feature, Spectrogram using Short-Time-Fourier-Transform(STFT)


Fig. Interpretable Convolutional Filters with SincNet, 2018


Fig. LEAF: A Learnable Frontend for Audio Classification, 2021


year conference research organization title link code
2013 ASRU Google Learning filter banks within a deep neural network framework paper
2015 Interspeech Google Learning the Speech Front-end With Raw Waveform CLDNNs paper
2015 ICASSP Hebrew University of Jerusalem, Google Speech acoustic modeling from raw multichannel waveforms paper
2018 ICASSP Facebook AI Research (FAIR), CoML Learning Filterbanks from Raw Speech for Phone Recognition paper code(pytorch, official)
2018 - MILA Interpretable Convolutional Filters with SincNet paper code(official)
2018 SLT MILA Speaker recognition from raw waveform with sincnet paper code(official)
2021 ICLR Google LEAF: A Learnable Frontend for Audio Classification paper



2. Self-Supervised(or Semi-Supervised) Learning for Speech

Self-Supervised(or Semi-Supervised) Learning 이란 Yann Lecun이 강조했을 만큼 현재 2020년 현재 딥러닝에서 가장 핫 한 주제중 하나이며, 
Label되지 않은 방대한 data를 self-supervised (or semi-supervised) 방법으로 학습하여 입력으로부터 더 좋은 Representation을 찾는 방법입니다. 
이렇게 사전 학습(pre-training)된 네트워크는 음성 인식 등 다른 task를 위해 task-specific 하게 미세 조정 (fine-tuning)하여 사용합니다.

사전 학습 방법은 AutoEncoder 부터 BERT 까지 다양한 방법으로 기존에 존재했으나 음성에 맞는 방식으로 연구된 논문들이 최근에 제시되어 왔으며, 
이렇게 학습된 네트워크는 scratch 부터 학습한 네트워크보다 더욱 높은 성능을 자랑합니다.

Fig. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020


year conference research organization title link code
2019 Facebook AI Research (FAIR) Effectiveness of self-supervised pre-training for speech recognition paper
2019 Interspeech Facebook AI Research (FAIR) wav2vec: Unsupervised Pre-training for Speech Recognition paper code(offiial, pytorch)
2020 ACL Facebook AI Research (FAIR) Unsupervised Cross-lingual Representation Learning at Scale paper code(offiial, pytorch)
2020 ICLR Facebook AI Research (FAIR) vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations paper code(offiial, pytorch)
2020 NIPS Facebook AI Research (FAIR) wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper code(offiial, pytorch)
2020 - Facebook AI Research (FAIR) Unsupervised Cross-lingual Representation Learning for Speech Recognition paper code(offiial, pytorch)
2020 Interspeech Facebook AI Self-Supervised Representations Improve End-to-End Speech Translation paper
2020 ICASSP Facebook AI Research (FAIR) Unsupervised Pretraining Transfers Well Across Languages paper
<= =>
2019 Universitat Polite cnica de Catalunya Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN paper
2019 Interspeech Universitat Politècnica de Catalunya, MILA et al. Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks paper code(official)
2020 ICASSP MILA et al. MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION paper code(official)
<= =>
2018 - Deepmind Representation Learning with Contrastive Predictive Coding paper code(offiial, pytorch)
2019 - Deepmind Learning robust and multilingual speech representations paper
2020 Interspeech National Taiwan University SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering paper
2020 DeepMind, University of Oxford Learning robust and multilingual speech representations paper
2020 MIT CSAIL SEMI-SUPERVISED SPEECH-LANGUAGE JOINT PRE- TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING paper
2021 MIT CSAIL Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining paper code(official, pytorch)
2021 Facebook AI Generative Spoken Language Modeling from Raw Audio paper
2020 ICASSP University of Oxford, Naver Disentangled Speech Embeddings using Cross-modal Self-supervision paper



3. End-to-End Speech Recognition

I recommend you to read graves' thesis : Supervised Sequence Labelling with Recurrent Neural Networks, 2008

3.1 CTC based ASR model

  • If you're new to CTC-based ASR model, you'd better see this blog before reading papers : post for CTC from Distill blog

  • One of the most important issues in Sequence Generation tasks such as Automatic Speech Recognition (ASR) and Optical Character Recognition(OCR) is alingment problem that map input sequence into target sequence. In 2006, Connectionist Temporal Classification (CTC) was proposed by Alex Graves, the researcher of Deepmind. The proposed CTC Loss is designed to deal with mentioned alignment problem. This is one of the most popular method in ASR along with Seq2Seq method.

음성인식(Automatic Speech Recognition), 활자인식(Optical Character Recognition, OCR) 등의 Sequence Generation task의 
주요 문제점 중 하나는 바로 정렬(alignment) 문제입니다.
이는 음성인식을 예로 들자면, 음성(입력 데이터) 와 이에 대응하는 문장(맞춰야할 정답)의 sequence길이가 서로 다르기 떄문에
어디서 부터 어디까지가 토큰(단어, 문자)에 매핑되는지 알 수 없는 문제를 이야기합니다.

2006년 Alex Graves에 의해 제안된 Connectionist Temporal Classification 논문에서 
제안된 CTC loss는 바로 이를 해결하기 위해 제안된 방법 이며,
이는 음성인식에서 1.2 section의 Attention 을 활용한 Seq2Seq 기법과 쌍벽을 이루는 기법입니다.

The figure shows the frame-level character probabilities emitted by the CTC layer


Fig. Towards End-to-End Speech Recognition with Recurrent Neural Networks, 2014


Fig. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin, 2016


year conference research organization title model link code
2006 ICML University of Toronto Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks CTC paper code(pytorch),warp-ctc,code(pytorch)
2014 ICML Deepmind Towards End-To-End Speech Recognition with Recurrent Neural Network LSTM-based CTC model paper
2014 Baidu Research Deep speech: Scaling up end-to-end speech recognition paper code(tensorflow),code(pytorch)
2016 ICML Baidu Research Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin CNN-based CTC model paper code(pytorch)
2014 Baidu Research (Deep Speech 3) Exploring Neural Transducers for End-to-End Speech Recognition paper
2016 Facebook AI Research (FAIR) Wav2Letter: an End-to-End ConvNet-based Speech Recognition System CNN-based CTC model paper code(official pytorch, C++)
2018 Google STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS paper
2019 Interspeech Nvidia Jasper: An End-to-End Convolutional Neural Acoustic Model CNN-based CTC model paper code(official),code(pytorch)
2019 Nvidia Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions paper

3.2 Seq2Seq with Attention based ASR model

Attention 을 활용한 Seq2Seq ASR 네트워크는, 2014년에 제안된 기계번역 분야의 breakthrough 였던
'Neural Machine Translation by Jointly Learning to Align and Translate'논문과 굉장히 유사한 네트워크로,
CTC와 마찬가지로 음성인식에서의 alignment 문제를 획기적으로 해결한 방법입니다.

이는 Auto-regressive하게 디코딩한다는 문제점이 존재하기는 하지만 가장 강력한 성능을 내는 End-to-End 기법 중 하나입니다.

과거 HMM-GMM, HMM-DNN 모델의 음향 모델(Acoustic Model, AM), 언어 모델(Language Model, LM)등의 역할을
Seq2Seq 모델의 인코더(Encoder), 디코더(Decoder)가 한다고 알려져 있습니다.

Fig. Listen, Attend and Spell


year conference research organization title model link code
2015 NIPS University of Wrocław, Jacobs University Bremen, Universite ́ de Montre ́al et al. Attention-Based Models for Speech Recognition Seq2Seq with Attention paper code(pytorch, code2(pytorch
2015 ICASSP Google Listen, Attend and Spell Seq2Seq with Attention paper code(pytorch)
2016 ICASSP Jacobs University Bremen, University of Wrocław, Universite ́ de Montre ́al, CIFAR Fellow End-to-End Attention-based Large Vocabulary Speech Recognition Seq2Seq with Attention paper
2018 ICASSP Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition Seq2Seq with Attention paper code(official),another ref code
2019 ASRU A Comparative Study on Transformer vs RNN in Speech Applications Seq2Seq with Attention paper
2019 - Facebook End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures Training either CTC or Seq2Seq loss functions paper

3.3 CTC & Attention Hybrid Model

CTC loss와 Seq2Seq loss를 둘 다 사용하여(jointly) 모델링한 이 네트워크는 
앙상블 효과를 누리는 느낌으로(?) End-to-End 음성인식 네트워크를 학습을 더욱 잘되게 합니다.

보통 CTC loss와 Seq2Seq loss를 합이 1이되게 interporation 하며, 학습 시간이 지날수록 이 비율을 바꾸며(sceheduling) 학습합니다.

Fig. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, 2017


year conference research organization title model link code
2017 Hybrid CTC/Attention Architecture for End-to-End Speech Recognition paper
2017 Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning paper code(pytorch)
2019 Transformer-based Acoustic Modeling for Hybrid Speech Recognition paper

3.4 Neural Transducer(RNN-T) based ASR model

Neural Transducer(RNN-T)의 개념은 Alex Graves에 의해서 'Sequence Transduction with Recurrent Neural Networks'라는 제목의 논문으로 
처음 소개되었습니다.

종단간(End-to-End) 음성인식(ASR) 모델들은 그동안 CTC loss나 Seq2seq loss를 활용한 RNN 기반 다양한 모델들이 있었지만,
이들은 전체 음성 입력을 받아 문장을 예측한다던가, 모두 실시간(Real-time or Streaming) 음성인식에 적합하지 않았고 
이를 해결하기 위해 제안된 개념이 바로 Neural Transducer(RNN-T)입니다.

RNN 네트워크는 물론 최근 NLP뿐 아니라 CV에서도 연일 최고성능(SOTA)을 갈아치우고 있는 Transformer로 대체할 수 있습니다. 

Fig. Neural Transducer


Fig. Streaming E2E Speech Recognition For Mobile Devices, 2018


year conference research organization title model link code
2012 ICML University of Toronto Sequence Transduction with Recurrent Neural Networks paper
2015 NIPS Google Brain, Deepmind, OpenAI A Neural Transducer paper
2017 ASRU Google Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer paper
2018 ICASSP Google Streaming E2E Speech Recognition For Mobile Devices paper code(tensorflow)
2019 ASRU Microsoft Improving RNN Transducer Modeling for End-to-End Speech Recognition paper
2019 Interspeech Chinese Academy of Sciences et al. Self-Attention Transducers for End-to-End Speech Recognition paper
2020 ICASSP Google Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss paper code(pytorch)
2020 ICASSP Google A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency paper
2020 Interspeech Google ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context CNN based RNN-T paper
2020 Interspeech Google Conformer: Convolution-augmented Transformer for Speech Recognition paper code(pytorch), code2(pytorch)
2021 ICASSP Google FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization paper
2021 ICASSP Facebook AI Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer paper

3.5 Streaming ASR

사실 1.4의 RNN-T가 곧 Straeming ASR을 위해 디자인 되었는데 그게 그거 아니냐 라고 할 수도 있지만,
RNN-T 이외에도, 어텐션 기반 seq2seq모델만으로 하려는 시도가 있었고, seq2seq 와 RNN-T를 합친 모델 등도 있었기 때문에
따로 빼서 서브섹션을 하나 더 만들었습니다.

Fig. Two-Pass End-to-End Speech Recognition, 2019


Fig. Streaming automatic speech recognition with the transformer model, 2020


year conference research organization title model link code
2018 ICLR Google Brain Monotonic Chunkwise Attention Seq2Seq with Attention paper
2019 Interspeech Google Two-Pass End-to-End Speech Recognition LAS+RNN-T paper
2019 Interspeech Samsung Research END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM paper
2020 ICASSP MERL Streaming automatic speech recognition with the transformer model paper
2020 Interspeech Google Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition paper
2021 ICLR Google Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling paper

3.6 ASR Rescoring / Spelling Correction

tmp
year conference research organization title model task link code
2019 ICASSP University of California, Los Angeles, Google A Spelling Correction Model For E2E Speech Recognition LAS based paper
2019 ACML Seoul National University(SNU) Effective Sentence Scoring Method Using BERT for Speech Recognition BERT based asr paper
2020 ICASSP Moscow Institute of Physics and Technology, NVIDIA Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model Transformer based



4. End-to-End Spoken Language Understanding

Spoken Language Understanding (SLU)는 speech dialog system의 front-end 입니다.

기존의 SLU pipeline은 음성을 입력받아 ASR 네트워크가 텍스트를 출력하고, 
이를 입력으로 받은 Natural Language Understanding (NLU) 네트워크가 감정(emotion)/의도(intent,slot) 등의 semantic information을 추출했습니다.

하지만 이런 pipeline은 치명적인 단점을 가지고 있는데요 바로 ASR 네트워크가 출력한 문장에 에러가 포함되어 있을 수 있고,
이럴 경우 NLU입장에서 이는 이해할 수 없기 때문에 형편없는 결과를 추출할 수 밖에 없다는 것입니다.

End-to-End Spoken Language Understanding (E2E SLU)란 음성을 입력으로 받아 direct로 결과를 출력함으로써
음성인식 네트워크가 가지고 있는 에러율에 구애받지 않고 semantic information을 뽑는 기법으로 최근에 활발히 연구가 진행되고 있는 분야입니다.

( Conventional Pipeline for Spoken Language Understanding ( ASR -> NLU ) )


( End-to-End Spoken Language Understanding Network )


Fig. Towards End-to-end Spoken Language Understanding, 2018


4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )

SLU 분야는 꾸준히 연구되어 오긴 했으나, E2E SLU는 최근에서야 활발히 연구되고 있는 분야입니다. 
그렇기 때문에 입력이 텍스트가 아닌 음성으로 되어있는 (예를 들어, Speech - Intent pair) 데이터가 부족합니다.
따라서 연구를 하실 때 괜찮은 public dataset을 찾기 힘들기 때문에 이와 관련된 데이터들부터 리스트업 해보고자 합니다. 
  • Intent Classification (IC) + (Named Entity Recognition (NER) or Slot Filling (SF))
  • Spoken Question Answering (SQA)
  • Speech Emotion Recognition (SER)
task dataset name language year conference title paper link dataset link
- SLURP english 2020 EMNLP SLURP: A Spoken Language Understanding Resource Package paper dataset
IC Fluent Speech Command(FSC) english 2019 Interspeech Speech Model Pre-training for End-to-End Spoken Language Understanding paper dataset
IC SNIPS english 2018 Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces paper dataset
IC ATIS english 1999 ACL The atis spoken language sys- tems pilot corpus paper dataset
IC TOP or Facebook Semantic Parsing System (FSPS) english 2019 Semantic Parsing for Task Oriented Dialog using Hierarchical Representations paper
SQA Spoken SQuAD(SSQD) english 2018 Interspeech Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension paper dataset
SQA Spoken CoQA english 2020 - Towards Data Distillation for End-to-end Spoken Conversational Question Answering paper dataset
SQA Odsqa chinese 20- - Odsqa: Open-domain spoken question answering dataset - -
ER IEMOCAP english 2017 - IEMOCAP: Interactive emotional dyadic motion capture database paper dataset
ER CMU-MOSEI english 2018 - Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph paper dataset

4.2 Intent Classification (IC) + (Named Entity Recognition (NER) or Slot Filling (SF))

year conference research organization title model task link code
2018 ICASSP Facebook, MILA Towards End-to-end Spoken Language Understanding IC only paper
2019 Interspeech MILA,CIFAR Speech Model Pre-training for End-to-End Spoken Language Understanding IC only paper code(official)

4.3 Spoken Question Answering (SQA)

year conference research organization title model link code
2018 Interspeech Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension dataset paper github

4.4 Speech Emotion Recognition (SER)

year conference research organization title model link code



5. End-to-End Speech Synthesis

Fig. WaveNet: A Generative Model for Raw Audio, 2016


Fig. Tacotron: Towards End-to-End Speech Synthesis, 2017



year conference research organization title model link code
2016 Deepmind WaveNet: A Generative Model for Raw Audio paper code(tensorflow),code(pytorch)
2018 ICML Deepmind Parallel WaveNet: Fast High-Fidelity Speech Synthesis paper
2017 ICLR University of Montreal et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model paper code(official)
2017 ICLR Montreal Univ, CIFAR Char2Wav: End-to-End Speech Synthesis paper
<= =>
2017 ICML Baidu Research Deep Voice: Real-time Neural Text-to-Speech DeepVoice Series paper
2017 NIPS Baidu Research Deep Voice 2: Multi-Speaker Neural Text-to-Speech DeepVoice Series paper
2018 ICLR Baidu Research Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning DeepVoice Series paper code
<= =>
2017 Interspeech Google Tacotron: Towards End-to-End Speech Synthesis Tacotron Series paper code(tensorflow), code(pytorch)
2017 NIPS KAIST et al. Emotional End-to-End Neural Speech Synthesizer Tacotron Series paper
2018 ICML Google Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron Tacotron Series paper code(tensorflow)
2018 ICML Google Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis Tacotron Series paper
2018 ICASSP Google Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2) Tacotron Series paper
2021 ICLR Google Research Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling Tacotron Series paper
<= =>
2019 ICLR UC San Diego Adversarial Audio Synthesis GAN paper code(official, tensorflow)
2020 ICASSP LINE, NAVER Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram GAN paper code(official)
<= =>
2019 AAAI University of Electronic Science and Technology of China et al. Neural Speech Synthesis with Transformer Network paper
2019 NIPS NVIDIA FastSpeech: Fast, Robust and Controllable Text to Speech paper code(pytorch)
2021 ICLR NVIDIA FastSpeech 2: Fast and High-Quality End-to-End Text to Speech paper
<= =>
2019 ICASSP Nvidia WaveGlow: a Flow-based Generative Network for Speech Synthesis Flow-based paper code(official, pytorch)
2020 NIPS Kakao Enterprise, SNU Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search Flow-based paper
<= =>
2019 ICLR Baidu Research ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech paper
2020 ICML Baidu Research Non-Autoregressive Neural Text-to-Speech paper
2020 ICASSP Google Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis paper


6. End-to-End Non-Autoregressive Sequence Generation Model

일반적인 End-to-End 음성인식 모델의 단점 중 하나인 Autoregressive decoding 방법을 해결하기 위한 기법들이 최근 제안되고 있습니다.
하지만 Non-Autoregressive 음성 인식 모델은 논문이 별로 없기 때문에 기계번역(NMT)/음성인식(STT)/음성합성(STT) 모두 포함하려고 .

6.1 Non-Autoregressive(NA) NMT

Fig. NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION, 2018


Fig. Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior, 2020


year conference research organization title model link code
2018 ICLR The University of Hong Kong NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION paper code(fairseq)
2018 ACL NYU Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement paper code(official), code(fairseq)
2019 NIPS Facebook AI Research (FAIR) Levenshtein Transformer paper code(official, fairseq)
2019 ACL Nanjing University et al. Non-autoregressive Transformer by Position Learning paper
2019 NIPS CMU,Berkeley,Peking University Fast Structured Decoding for Sequence Models paper code(fairseq)
2020 ACL Google Non-Autoregressive Machine Translation with Latent Alignments paper code
2019 EMNLP CMU, Facebook AI FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow paper code(official)
2020 ACL Toyota Technological Institute at Chicago et al. ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation paper
2020 AAAI University of Tokyo, FAIR, MILA, NYU Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior paper code(official, pytorch)

6.2 Non-Autoregressive(NA) ASR (STT)

Fig. Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict, 2020


Fig. Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition, 2020


year conference research organization title model link code
2020 Interspeech Johns Hopkins University et al. Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict CTC-based paper
2020 Interspeech Chinese Academy of Sciences et al. Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition CTC-based paper
2020 ACL Zhejiang University A Study of Non-autoregressive Model for Sequence Generation paper

6.3 Non-Autoregressive(NA) Speech Synthesis (TTS)

year conference research organization title model link code



7. Some Trivial Schemes for Speech Tasks

Fig. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition, 2019


Fig. When does label smoothing help?, 2019


year conference research organization title link code
2017 ACL Facebook AI Research (FAIR) Bag of Tricks for Efficient Text Classification paper code(official)
2017 ICLR Google Brain, University of Toronto Regularizing Neural Networks by Penalizing Confident Output Distributions paper -
2018 ICLR Google Brain Don't decay the learning rate, Increase the batch size paper code(pytorch)
2019 NIPS Google Brain, University of Toronto When does label smoothing help? paper code(pytorch)
2019 Interspeech Google Brain SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition paper code, code2

About

awesome_speech_papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published