This project is a complete Text-to-Speech (TTS) pipeline built from scratch in PyTorch. It implements the Tacotron 2 model for generating mel spectrograms from text, paired with a pre-trained HiFi-GAN vocoder to synthesize the final audio waveform.
- End-to-End Pipeline: From raw text and audio to a synthesized waveform.
- Tacotron 2 Architecture: A faithful implementation of the spectrogram prediction network described in the original paper.
- Phoneme-Based Input: Uses a Grapheme-to-Phoneme (G2P) converter for robust handling of English pronunciation.
- High-Fidelity Vocoder: Integrates a pre-trained HiFi-GAN vocoder for fast and high-quality waveform synthesis.
- Universal Training Script: The training script is hardware-agnostic and will automatically use the best available device: NVIDIA GPU (
cuda), Apple Silicon GPU (mps), orcpu.
- Python 3
- PyTorch
- librosa (for audio processing)
- g2p-en & NLTK (for phonemization)
- scipy (for saving audio files)
multi-speaker-tts/
├── data/ # For storing datasets like LibriSpeech
├── generated_audio/ # Default output directory for generated samples
├── notebooks/ # Jupyter notebooks for exploration
├── src/ # Main source code for the project
│ ├── audio.py
│ ├── config.py
│ ├── data_utils.py
│ ├── model.py
│ └── text.py
├── trained_models/ # For storing trained model checkpoints
├── .gitignore
├── inference.py # Script for generating audio from a trained model
├── prepare_metadata.py # Script to parse the raw dataset
├── requirements.txt # Project dependencies
└── train.py # The main training script
-
Clone the repository:
git clone https://github.com/Angad8285/pytorch-tacotron2-implementation.git cd pytorch-tacotron2-implementation -
Create and activate a Python virtual environment:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Download NLTK data: You need to download the data packages for the phonemizer. Create a temporary Python script named
download_data.pywith the following content and run it once.import nltk nltk.download('cmudict') nltk.download('averaged_perceptron_tagger')
python download_data.py
First, download a dataset like LibriSpeech or LJSpeech and place it in the data/ directory. Then, run the metadata script to create the manifest.
python prepare_metadata.py path/to/your/dataset_folderTo train the model, run the train.py script. You must provide the path to your metadata file and a directory to save checkpoints.
python train.py data/metadata.csv trained_models/ --epochs 100 --batch_size 16To generate audio from a trained model, use the inference.py script. You must provide the text to synthesize and the path to a saved checkpoint.
python inference.py "Hello world, this is a test." --checkpoint trained_models/tacotron2_epoch_100.pth --output_dir generated_audioThe output file will be saved as output_1.wav in the generated_audio folder. Subsequent runs will create output_2.wav, and so on.