TWB neural machine translation pipeline
This repository contains the necessary scripts for training and evaluating a neural machine translation system from scratch. It serves both as a template and a memory of experiments for various language pairs within TWB's Gamayun initiative.
Master branch is kept as a template for training a toy model of Tigrinya to English. Make sure it works before building your own pipeline.
Development for each language pair is kept in branches.
- English -> Tigrinya
- OpenNMT-py
- Moses
- subword-nmt
- mt-tools
- pandas
Three main directories used in the pipeline are as follows:
- scripts contains the scripts for preparing the data, training models and evaluation
- corpora (created within scripts) is where datasets containing parallel sentences are downloaded and then processed (created within scripts)
- onmt (created within scripts) contains the models, logs, Open-NMT specific datasets, inference results and evaluation scores
The data processing is made visible with the use of suffixes. Each dataset obtains a suffix after a certain pipeline call. The naming convention while processing is as follows:
/corpora/<corpus-name>/<dataset>.<set>.<process1>.<process2>...<processX>.<lang-code>
For example:
/corpora/OPUS/Tatoeba.train.tok.low.en contains the English sentences in the training portion of the Tatoeba set downloaded from OPUS, tokenized and then lowercased.
Each script consists of three sections:
INITIALIZATIONS: Sets the paths for later use. This needs not be edited if the directory structure is kept as it isPROCEDURES: Contains the necessary functions of the pipeline stepCALLS: This is where the pipeline's procedure is called. Each call consists of a set of parameters being specified followed with a procedure call
The scripts to run are sorted under scripts directory with a number prefix. Scripts don't take any parameters themselves but need to be edited inside. To run a particular script just type:
bash #-<script-name>.sh
To be used for downloading parallel datasets. Download function is set to work with OPUS links by default. Parameters to specify for each call:
LINK: URL to corpusCORPUS: Subdirectory where dataset is kept undercorpora
This will place the parallel files in form of <corpus-name>.<lang1>, <corpus-name>.<lang2> under a directory with the name of the corpus under corpora. If you skip this step make sure you follow a similar pattern of filepath and naming.
Calls various text normalization scripts. For each call specify:
CORPUS: Subdirectory where dataset is keptC: Corpus nameLANGS: Language extensions of the parallel files (lang1,lang2)
Cleans empty and dirty samples, allocates a development set and also excludes test samples from a given dataset. Actual processing is done in data_prep.py. For each call specify:
CORPUS: Subdirectory where dataset is keptC: Corpus nameSRC: Source language codeTGT: Target language codeSUFFIX: Complete process suffix that the dataset to be processed hasEXCLUDESET: Set of sentences that need to be taken out from the train/dev portionEXCLUDEFROM: side of the exclude set assrcortgt
Size of development set is specified in data-prep.py.
Uses moses tokenizer or any other tokenization script for punctuation tokenization. Set MOSESDIR to where moses decoder is kept. If a special tokenizer is to be used then it's specified by SRCTOKENIZER or TGTTOKENIZER.
For each dataset to be tokenized specify the following:
CORPUS: subdirectory where dataset is keptC: Corpus nameSETS: Subsets (train/test/dev) (If this is used thentokenize_setprocedure needs to be called)SUFFIX: Complete process suffix that the dataset to be processed has.SRC: Source language codeTGT: Target language code
Trains byte-pair-encoding (BPE) tokens from a given set and stores under onmt/bpe. Uses subword-nmt library. This should be done on one big set using the parameters:
CORPUS: Subdirectory where dataset is keptC: Corpus nameSRC: Source language code for BPE trainingTGT: Target language code for BPE trainingBPEID: ID of the BPE modelOPS: Number of BPE operations.
Applies BPE tokenization to the datasets. This needs to be done to all train/dev/test sets.
Parameters related to BPE model:
BPEID: ID of the BPE model with #operationsBPESRC: Source language code used for BPE trainingBPETGT: Target language code used for BPE training
For datasets to be bpe-ized:
CORPUS: Subdirectory where dataset is keptC: Corpus nameSUFFIX: Complete process suffix that the dataset to be processed has.SETS: Subsets (train/test/dev) (If this is used thenapply_bpe_to_setsprocedure needs to be called)SRC: Source language codeTGT: Target language code
This converts the prepared training datasets into form that OpenNMT takes in. For each dataset both training and a development set needs to be specificed.
CORPUSTRAIN: Path to training set without language suffixCORPUSDEV: Path to development set without language suffixSRCTRAIN: Source language code for training setTGTTRAIN: Target language code for training setSRCDEV: Source language code for development setTGTDEV: Target language code for development setDATASET: Name for the training dataset
There are three training scripts: 7a-train-multilingual.sh, 7b-train-unilingual.sh, 7c-train-indomain.sh. This is for three stage training in low-resource settings.
MODELPREFIX: A label for the modelMODELID: An ID for the modelDATASET: Preprocessed training dataset nameBPEID: ID of the BPE model with #operations
Scripts 7b and 7c continues training on the best scoring model from the previous step. These additional parameters need to be set:
BASEMODELTYPE: Type of the base model (multilingual, unilingual or indomain)BASEMODELID: ID of the base model
Training parameters are specified in the do_train procedure while calling OpenNMT-py's training script.
Once training is complete, best scoring model from each step can be found under: onmt/models/<model-prefix>-<model-type>-<model-id>/<model-prefix>-<model-type>-<model-id>_best.pt
This step is for translating test sets using the trained models. For each call specify the following parameters:
MODELPREFIX: Label of the model to use for inferenceMODELTYPE: Type of the model to use for inferenceMODELID: ID of the model to use for inferenceBPEID: ID of the BPE model with #operationsCORPUSTEST: Path to testing set without the language suffixSRC: Source language codeTGT: Target language code
Inference results (direct and un-bpe'd) are saved under the path onmt/test.
This final steps calculates various MT evaluation metrics on the inferred translations. Evaluation scripts are accessed from mt-tools.
For each call specify the following:
MODELPREFIX: Label of the model to use for inferenceMODELTYPE: Type of the model to use for inferenceMODELID: ID of the model to use for inferenceBPEID: ID of the BPE model usedCORPUSTEST: Path to testing set without the language suffixSRC: Source language codeTGT: Target language code
Evaluation results are printed and also stored under the path onmt/eval.