This project implements three different end-to-end Automatic Speech Recognition (ASR) architectures using PyTorch.
The ASR models are based on the following architectures:
The models were trained and tested on a subset of the HarperValleyBank Dataset4, which is hosted here. The dataset is used to train models that predict each spoken character.
- Uses Librosa to extract WAV log melspectrogram
- Character encoding
- Multiple implementations of ASR model architectures, including attention-based models
- Regularization of attention-based networks to respect CTC alignments (LAS-CTC)
- Utilizes Lightning Trainer API
- Training process logs and visualizations with Wandb
- Teacher-forcing
- Greedy decoding
- Imposes a CTC objective on the decoding
- CTC-Rules
- Download and unzip the dataset:
unzip harper_valley_bank_minified.zip -d data
Model run report obtained from Wandb
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, A Graves et al.
- Listen, Attend and Spell, W Chan et al.
- Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, S Kim et al.
- CS224S: Spoken Language Processing
This README provides an overview of the project, highlighting its main features, the technology stack, and usage instructions. For more detailed documentation, please refer to the project files and comments within the code.