Scripts download and prepare datasets using the two largest speech corpora for Catalan: Common Voice v8.0 and ParlamentParla. Training scripts are based on official commonvoice recipe. A phonetic dictionary derived from Alpha Cepei's VOSK model is provided in dict/ca directory. Text corpus to train the language model is derived from the training and development text plus an additional clean text corpus derived from OpenSubtitles (corpus/CA_OpenSubtitles_clean.txt). Evaluation is performed on test sets of both corpora.
You need to first install Kaldi and SRILM. We provide the instructions, however you should make sure to follow the official guidelines in Kaldi repository:
git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi/tools
extras/check_dependencies.sh
make -j 4
cd ../src
./configure --shared
make -j clean depend
make -j 8
cd ..
To install SRILM:
/opt/kaldi/tools/install_srilm.sh <name> <company> <email>
Then you should clone this repository under egs directory of kaldi:
cd egs
git clone https://github.com/CollectivaT-dev/kaldi-cat.git
cd kaldi-cat
Finally, make sure you also have Python 3 and installed the required modules:
pip install tqdm pandas
We provide a docker setup that takes care of all instalations.
git clone https://github.com/CollectivaT-dev/kaldi-cat.git
cd kaldi-cat/docker
docker build -t kaldidock kaldi
Once the image had been built, all you have to do is interactively attach to its bash terminal via the following command:
docker run -it -v <path-to-repo>:/opt/kaldi/egs/kaldi-cat \
-v <path-to-corpus-base-directory>:/mnt \
--gpus all --name <container-name> <built-docker-name> bash
Once you finish this step, you should be in a docker container's bash terminal now to start the training.
All training scripts are inside s5 directory:
cd <kaldi-dir>/egs/kaldi-cat/s5
If you're using GPU (and you should), make sure to flag them:
export CUDA_VISIBLE_DEVICES=0,1
To start training, all you need to do is call run.sh specifying a directory where to download the corpora:
bash run.sh --corpusbase <corpus-base-directory> #if running from docker <corpus-base-directory> is "/mnt"
To train toy models to see if all the process works smoothly, you can use the subset option. This will prepare a training dataset using only a specified number of samples:
bash run.sh --corpusbase <corpus-base-directory> --subset 1000
Evaluations are done on separately on testing portions of the two corpora. run.sh will print out WER scores at the end.
To be published soon...
- Kaldi installation and mini-librispeech test run
- Common voice dataset download scripts
- Dockerfile
- Text corpus from Common Voice
- Builds language model
- Fake phonetic dictionary
- Audio data to Kaldi format
- Proper phonetic dictionary
- ParlamentParla download scripts
- PP to Kaldi format
- Combined train/dev/test1/test2
- run.sh test training/evaluation
- Extend LM corpus
- G2P model
- Documentation
- Docker test
- Extend phonetic dictionary?