About This Repository


This repository is for those who want to study or research Speech tasks ( Speech Recognition, Speecn Synthesis so on).


이 페이지는 음성 관련 task (음성 인식, 음성 합성  등)를 공부 및 연구하고 싶은 newbie들을 위해 만들어짐. 
최대한 페이퍼를 많이 포함하기 보다는 중요하고(citation이 충분히 높고, 신뢰할 만한 기관에서 수행했으며, 
top 컨퍼런스/에 publish된 페이퍼 위주) 최신자 페이퍼들만 포함하려고 함.(주관적일 수 있음) 

갑자기 잡동사니가 되었습니다.

temporal (training schemes or undefined)

don't decay the learning rate, increase the batch size, paper
when does label smoothing help? paper
Bag of Tricks for Efficient Text Classification paper
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition paper
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions paper

Index

1.End-to-End Speech Recognition papers
- CTC-based ASR papers
- Attention-based ASR papers
- Hybrid ASR papers
- RNN-T based ASR papers
- Streaming ASR papers
2.End-to-End Speech Synthesis papers
3.End-to-End Non-Autoregressive Sequence Generation papers
- ASR
- NMT
- TTS
4.End-to-End Spoken Language Understanding
- Intent Classification papers
- Spoken Question Answering papers
- Speech Emotion Recognition papers
5.Self-Supervised(or Semi-Supervised) Learning for Speech
TBC
- Voice Conversion
- Speaker Identification
- MIR ?
- Rescoring
- Speech Translation

1. End-to-End Speech Recognition

1.1 CTC based ASR model

If you're new to CTC-based ASR model, you'd better see this blog before reading papers : post for CTC from Distill blog
- additional : For Korean : link1, link2

< Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin >

year	conference	research organization	title	model	link	code
2006	ICML	Toronto University	Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks	CTC	paper
2014			Deep speech: Scaling up end-to-end speech recognition
2016	ICML		Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin	CTC-based CNN model	paper	code(pytorch)
2019	Interspeech	Nvidia	Jasper: An End-to-End Convolutional Neural Acoustic Model
2019		Nvidia	Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions

1.2 Attention based ASR model

If you're new to seq2seq with attention network, you'd better check following things

< Listen, Attend and Spell >

year	conference	research organization	title	model	link	code
2008			Supervised Sequence Labelling with Recurrent Neural Networks
2014	ICML		Towards End-to-End Speech Recognition with Recurrent Neural Networks
2015	NIPS		Attention-Based Models for Speech Recognition	Seq2Seq
2015	ICASSP	Google	Listen, Attend and Spell	Seq2Seq	paper	code(pytorch)
2016			End-to-End Attention-based Large Vocabulary Speech Recognition
2017	ICLR		Monotonic Chunkwise Attention
2018	ICASSP		Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
2019			Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
2019			A Comparative Study on Transformer vs RNN in Speech Applications		paper
2019			End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures		paper
2020		Google	Conformer: Convolution-augmented Transformer for Speech Recognition		paper

1.3 Hybrid Model

year	conference	research organization	title	model	link	code
2019			Transformer-based Acoustic Modeling for Hybrid Speech Recognition		paper

1.4 RNN-T based ASR model

< Streaming E2E Speech Recognition For Mobile Devices >

year	conference	research organization	title	link
2012			Sequence Transduction with Recurrent Neural Networks
2018	ICASSP	Google	Streaming E2E Speech Recognition For Mobile Devices	paper
2018		Google	Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer
2019		Google	Improving RNN Transducer Modeling for End-to-End Speech Recognition
2019		-	Self-Attention Transducers for End-to-End Speech Recognition
2020	ICASSP	-	Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss
2020	ICASSP	-	A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
2021	ICASSP	-	FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
2021	ICASSP	-	Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer
2020		Google	ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context	paper

1.5 Streaming ASR

< Two-Pass End-to-End Speech Recognition >

year	conference	research organization	title	model	link	code
2019		Google	Two-Pass End-to-End Speech Recognition	LAS+RNN-T	paper

1.5 ASR Rescoring / Spelling Correction (2-pass decoding)


temporal

This is from link

year	title	task
2019	Automatic Speech Recognition Errors Detection and Correction
2019	A Spelling Correction Model For E2E Speech Recognition
2019	An Empirical Study Of Efficient ASR Rescoring With Transformers
2019	Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition
2019	Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model
2019	Effective Sentence Scoring Method Using BERT for Speech Recognition	asr
2019	Spelling Error Correction with Soft-Masked BERT	nlp
2019	Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition	asr

2. End-to-End Speech Synthesis

< Tacotron: Towards End-to-End Speech Synthesis >

year	conference	research organization	title	model	link	code
2016		Deepmind	WaveNet: A Generative Model for Raw Audio		paper
2017	ICLR	-	SampleRNN: An Unconditional End-to-End Neural Audio Generation Model		paper	code(official)
2017	ICLR	Montreal Univ, CIFAR	Char2Wav: End-to-End Speech Synthesis		paper
2017	PMLR	Baidu Research	Deep Voice: Real-time Neural Text-to-Speech		paper
2017	NIPS	Baidu Research	Deep Voice 2: Multi-Speaker Neural Text-to-Speech		paper
2017		Baidu Research	Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning		paper	code
2017		Google	Tacotron: Towards End-to-End Speech Synthesis		paper	code(tensorflow), code(pytorch)
2017	ICML		Emotional End-to-End Neural Speech Synthesizer
2018	ICML		Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
2018	ICML		Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
2021	ICLR	Google Research	Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling		paper
2018			Adversarial Audio Synthesis	GAN	paper	code(official, tensorflow)
2019	ICASSP	Nvidia	WaveGlow: a Flow-based Generative Network for Speech Synthesis		paper	code(official, pytorch)
2019			Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram		paper
2019	NIPS	NVIDIA	FastSpeech: Fast, Robust and Controllable Text to Speech		paper
2020	-	NVIDIA	FastSpeech 2: Fast and High-Quality End-to-End Text to Speech		paper
2020	NIPS	Kakao Enterprise, SNU	Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search		paper
2020	ICASSP		Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow		paper
2019	AAAI		Neural Speech Synthesis with Transformer Network		paper
2017			Parallel WaveNet: Fast High-Fidelity Speech Synthesis
2018		-	WaveGlow: A Flow-based Generative Network for Speech Synthesis
2020	ICASSP		Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

3. End-to-End Non-Autoregressive Sequence Generation Model


Non-Autoregressive 모델은 논문이 별로 없기 때문에 기계번역(NMT)/음성인식(STT)/음성합성(STT) 모두 포함하려고 함.

3.1 Non-Autoregressive(NA) NMT

< NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION >

< Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior >

year	conference	research organization	title
2018	ICLR	The University of Hong Kong	NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION
2020		Google	Non-Autoregressive Machine Translation with Latent Alignments
2020		CMU	FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
2020		CMU,Berkeley,Peking University	Fast Structured Decoding for Sequence Models
2019	ACL	-	Non-autoregressive Transformer by Position Learning
2020		-	ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
2019		University of Tokyo,FAIR,MILA,NYU	Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior

3.2 Non-Autoregressive(NA) ASR (STT)

< Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict >

< Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition >

year	conference	research organization	title	model
2020	Interspeech	-	Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict	CTC-based
2020	Interspeech	-	Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition	CTC-based
2020		-	A Study of Non-autoregressive Model for Sequence Generation

3.3 Non-Autoregressive(NA) Speech Synthesis (TTS)

year	conference	research organization	title	model	link	code
2020		Baidu Research	Non-Autoregressive Neural Text-to-Speech

4. End-to-End Spoken Language Understanding


기존의 Spoken Language Understanding (SLU) 는 음성을 입력받아 ASR module이 텍스트를 출력하고, 
이를 입력으로 받은 Natural Language Understanding (NLU) module이 감정(emotion)/의도(intent,slot) 등을 결과로 출력했다.

End-to-End Spoken Language Understanding (SLU)란 음성을 입력으로 받아 direct로 결과를 출력함으로써
음성인식 네트워크가 가지고 있는 에러율에 구애받지 않고 fully differentiable 하게 학습하는 것이 목적이다.

( Conventional Pipeline for Spoken Language Understanding ( ASR -> NLU ) )

( End-to-End Spoken Language Understanding Network )

< Towards End-to-end Spoken Language Understanding >

4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )

Intent Classification (IC)
Spoken Question Answering (SQA)
Emotion Recognition (ER)

task	dataset name	language	year	conference	title	paper link	dataset link
-	SLURP	english	2020	EMNLP	SLURP: A Spoken Language Understanding Resource Package	paper	dataset
IC	Fluent Speech Command(FSC)	english	2019	Interspeech	Speech Model Pre-training for End-to-End Spoken Language Understanding	paper	dataset
IC	SNIPS	english	2018		Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces	paper
IC	ATIS	english	1999		The atis spoken language sys- tems pilot corpus	paper
IC	TOP or Facebook Semantic Parsing System (FSPS)	2019		Semantic Parsing for Task Oriented Dialog using Hierarchical Representations	paper
SQA	Spoken SQuAD(SSQD)	english	2018	Interspeech	Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension	paper	dataset
SQA	Spoken CoQA	english	2020	-	Towards Data Distillation for End-to-end Spoken Conversational Question Answering	paper	dataset
SQA	Odsaqa	chinese	20-	-	Odsqa: Open-domain spoken question answering dataset	-	-
ER	IEMOCAP	english	2017	-	IEMOCAP: Interactive emotional dyadic motion capture database	paper	dataset
ER	CMU-MOSEI	english	2018	-	Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph	paper	dataset

4.2 Intent Classification (IC)

year	conference	research organization	title	model	link	code
2018	ICASSP	Facebook, MILA	Towards End-to-end Spoken Language Understanding		paper
2019	Interspeech	MILA,CIFAR	Speech Model Pre-training for End-to-End Spoken Language Understanding		paper	code(official)

4.3 Spoken Question Answering (SQA)

year	conference	research organization	title	model	link	code
2018	Interspeech		Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension	dataset	paper	github

4.4 Emotion Recognition (ER)

5. Self-Supervised(or Semi-Supervised) Learning for Speech


Self-Supervised(or Semi-Supervised) Learning 이란 Yann Lecun이 강조했을 만큼 현재 2020년 현재 딥러닝에서 가장 핫 한 주제중 하나이며, 
Label되지 않은 방대한 data를 self-supervised (or semi-supervised) 방법으로 학습하여 입력으로부터 더 좋은 Representation을 찾는 방법이다. 
이렇게 사전 학습(pre-training)된 네트워크는 음성 인식 등 다른 task를 위해 task-specific 하게 미세 조정 (fine-tuning)하여 사용한다.

사전 학습 방법은 AutoEncoder 부터 BERT 까지 다양한 방법으로 기존에 존재했으나 음성에 맞는 방식으로 연구된 논문들이 최근에 제시되어 왔으며, 
이렇게 학습된 네트워크는 scratch 부터 학습한 네트워크보다 더욱 높은 성능을 자랑한다 .

< wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations >

year	conference	research organization	title	link	code
2019	-	Facebook AI Research (FAIR)	wav2vec: Unsupervised Pre-training for Speech Recognition	paper	code(official)
2019	-	FAIR	Unsupervised Cross-lingual Representation Learning at Scale
2019	ICLR	FAIR	vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations	paper	code(official)
2020	-	FAIR	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	paper	code(official)
2020	-	FAIR	Unsupervised Cross-lingual Representation Learning for Speech Recognition	paper
2019	-	Deepmind	Learning robust and multilingual speech representations	paper
-	-		SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering	paper
-	-		Self-Supervised Representations Improve End-to-End Speech Translation	paper
-	-		Unsupervised Pretraining Transfers Well Across Languages
-	-		Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
-	-		Learning robust and multilingual speech representations
-	-		Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN
2020	-	MIT CSAIL	SEMI-SUPERVISED SPEECH-LANGUAGE JOINT PRE- TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING	paper

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About This Repository

temporal (training schemes or undefined)

Index

1. End-to-End Speech Recognition

1.1 CTC based ASR model

1.2 Attention based ASR model

1.3 Hybrid Model

1.4 RNN-T based ASR model

1.5 Streaming ASR

1.5 ASR Rescoring / Spelling Correction (2-pass decoding)

2. End-to-End Speech Synthesis

3. End-to-End Non-Autoregressive Sequence Generation Model

3.1 Non-Autoregressive(NA) NMT

3.2 Non-Autoregressive(NA) ASR (STT)

3.3 Non-Autoregressive(NA) Speech Synthesis (TTS)

4. End-to-End Spoken Language Understanding

4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )

4.2 Intent Classification (IC)

4.3 Spoken Question Answering (SQA)

4.4 Emotion Recognition (ER)

5. Self-Supervised(or Semi-Supervised) Learning for Speech

About

Releases

Packages

Jiltseb/awesome_speech_papers

Folders and files

Latest commit

History

Repository files navigation

About This Repository

temporal (training schemes or undefined)

Index

1. End-to-End Speech Recognition

1.1 CTC based ASR model

1.2 Attention based ASR model

1.3 Hybrid Model

1.4 RNN-T based ASR model

1.5 Streaming ASR

1.5 ASR Rescoring / Spelling Correction (2-pass decoding)

2. End-to-End Speech Synthesis

3. End-to-End Non-Autoregressive Sequence Generation Model

3.1 Non-Autoregressive(NA) NMT

3.2 Non-Autoregressive(NA) ASR (STT)

3.3 Non-Autoregressive(NA) Speech Synthesis (TTS)

4. End-to-End Spoken Language Understanding

4.1 Dataset ( including all speech slu dataset IC/SF/SQA ... )

4.2 Intent Classification (IC)

4.3 Spoken Question Answering (SQA)

4.4 Emotion Recognition (ER)

5. Self-Supervised(or Semi-Supervised) Learning for Speech

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages