The code of pre-training CPT is based on Megatron-LM.
For Setup, Data Processing of CPT, you can refer to the README of Megatron-LM. And the package jieba_fast is needed for Whole Word Masking pre-training.
Firstly, prepare files in the following folders:
dataset/: Place the.binand.idxfiles that preprocessed from raw text.vocab/: Place the vocab files and model config file.roberta_zh/: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.
Then, use the scripts run_pretrain_bart.sh and run_pretrain_cpt.sh to train Chinese BART and CPT, respectively.
NOTE: the training scripts is distributed examples for 8 GPUs. You may alter the number of GPUs and change the training steps to meet the need.
- Add
bart_modelandcpt_modelfor Megatron undermegatron/model, to let Megatron can train on BART and CPT. - Add
_HfBertTokenizerinmegatron/tokenizer/tokenizer.pyto let Megatron can use Tokenizers from Huggingface-Transformers. - Add
bart_datasetandcpt_datasetundermegatron/datato produce data for Whole Word Masking (WWM) and Denoising Auto-Encoder (DAE) pre-training. - Add
tools/convert_ckpt.pyto convert Megatron checkpoints to Huggingface-Transformers format. - Add
tools/preprocess_data.pyto preprocess and chunk large amount of text data into binary format used in Megatron.