Skip to content

Commit 24d76e1

Browse files
committed
add readme
1 parent 7f39016 commit 24d76e1

7 files changed

+73
-162
lines changed

.gitignore

+4-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,7 @@ results/
44
checkpoints/
55
__pycache__
66
*.so
7-
pretrained_prompts/
7+
pretrained_prompts/
8+
pretrain_data
9+
downstream_data
10+
checkpoints

README.md

+69-150
Original file line numberDiff line numberDiff line change
@@ -1,207 +1,126 @@
1-
# CPM-Finetune
1+
# PPT
22

3-
本仓库为CPM模型的 fine-tune 代码仓库,可以用于模型 fine-tune 的多机多卡训练/测试。目前支持了 ChID 中文成语填空数据集和 STC 中文对话数据集。[[项目首页](https://cpm.baai.ac.cn)] [[模型下载](https://cpm.baai.ac.cn/download.html)] [[技术报告](https://arxiv.org/abs/2012.00413)]
3+
Code and datasets for our paper "PPT: Pre-trained Prompt Tuning for Few-shot Learning"
44

5-
同时,该仓库也提供了 ChID 数据集 zero-shot setting 下测试代码。
65

7-
ChID 数据集来源于论文 [ChID: A Large-scale Chinese IDiom Dataset for Cloze Test](https://www.aclweb.org/anthology/P19-1075/).
86

9-
STC 数据集来源于论文 [Neural Responding Machine for Short-Text Conversation](https://www.aclweb.org/anthology/P15-1152/).
7+
## 1 Environment
108

11-
两个数据集可从[这里](https://drive.google.com/drive/folders/1gL01xbFBcrgP0TmgOhJ_uplkeG-BCwvM)下载。
9+
The code requires the CUDA10.2 toolkit.
1210

11+
##### Install basic dependencies
1312

14-
## 1 安装
15-
16-
首先安装 pytorch 等基础依赖,再安装[APEX](https://github.com/NVIDIA/apex#quick-start)以支持 fp16,然后安装 deepspeed:
17-
18-
**安装基础依赖:**
19-
20-
```[bash]
13+
```bash
2114
pip install -r requirements.txt
2215
```
2316

24-
**安装 apex**
17+
##### Install apex
2518

26-
```[bash]
19+
```bash
2720
git clone https://github.com/NVIDIA/apex
2821
cd apex
29-
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
22+
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
3023
```
24+
##### Install DeepSpeed
3125

32-
考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:
33-
34-
```[bash]
35-
docker pull dmye/cpm:v0
26+
The version we used is `v0.3.9`, It can be installed from its [repo](https://github.com/microsoft/DeepSpeed/releases/tag/v0.3.9) or
27+
```bash
28+
pip install deepspeed==0.3.9
3629
```
30+
Since there exist some **bugs** in DeepSpeed, you need to make some little modifications to this package. You can refer to this [issue](https://github.com/TsinghuaAI/CPM-2-Finetune/issues/11) for more information. Specifically, you need to modify two lines of code in `${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/zero/stage1.py` and `${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/engine.py`. We provide the modified `src/ds_fix/stage1.py` and `src/ds_fix/engine.py` in our repo. You can simply replace `${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/zero/stage1.py` with `stage1.py` and `${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/engine.py` with `engine.py` that we provided.
3731

38-
参考运行指令如下:
39-
40-
```[bash]
41-
sudo docker run --gpus '"device=0,1"' -it -v <path>:/CPM --name=cpm cpm:v0
42-
```
4332

44-
其中`<path>`为代码所在目录,-v进行文件目录挂载
4533

46-
**安装 deepspeed**
34+
## 2 Datasets
4735

48-
我们使用了以下版本的 deepspeed:
49-
<https://github.com/microsoft/DeepSpeed/tree/f0f2a7026877d3fd2f6f5565a67cdffc89750cf0>
50-
可根据其提供的文档进行安装。
51-
52-
53-
## 2 Fine-Tune
54-
55-
### 2.1 数据预处理
56-
57-
### 2.1.1 ChiD
58-
```[bash]
59-
python3 preprocess_chid_finetune.py --data_dir ${PATH_TO_DATA_DIR} --tokenizer_path ${PATH_TO_TOKENIZER} --output_dir ${PATH_TO_OUTPUT}
60-
```
36+
## 2.1 Downstream Datasets
6137

62-
其中,模板定义与实现在 preprocess_chid_finetune.py 文件 `process_one_sent` 函数中。最终,该文件生成的数据格式为:
38+
The original datasets is obtained from [huggingface](https://huggingface.co/datasets).
6339

64-
```[python]
65-
[
66-
{
67-
"sent": [8, 15, ....], # 经过 bpe 分词之后 token 对应的 id
68-
"truth": 3 # 正确答案成语的编号(0~9之间的整数)
69-
}
70-
...
71-
]
72-
```
40+
The preprocessed datasets can be obtained from this link. If you do tuning (FT, PT, or PPT), you need to put the preprocessed data in `downstream_data/`.
7341

74-
预处理完成后,指定的输出目录下会生成 train.json, valid.json, test.json 三个文件。
42+
## 2.2 Pre-training Data
7543

76-
### 2.1.2 STC
44+
Our pre-training data is sampled from [openwebtext](https://huggingface.co/datasets/openwebtext/tree/main). If you would like to preprocess the data from scratch, please put the `openwebtext.txt` in `pretrain_data/raw/`. Run the following preprocessing scripts to construct the pre-training data:
7745

78-
```[bash]
79-
python3 preprocess_stc_finetune.py --data_dir ${PATH_TO_DATA_DIR} --output_dir ${PATH_TO_OUTPUT}
46+
```bash
47+
bash scripts/tools/preprocess_pretrain_nsp.sh # Next Sentence Prediction
48+
bash scripts/tools/preprocess_pretrain_nss.sh # Next Sentence Selection
49+
bash scripts/tools/preprocess_pretrain_cls.sh # Single Sentence Classification
50+
bash scripts/tools/preprocess_pretrain_nss_uni.sh # Unified Next Sentence Selection (for Unified PPT)
8051
```
8152

82-
该文件会将每段对话写成一行,上文前加入“对话上文:”,下文前加入“回复:”:
83-
```
84-
对话上文:二十六年前的我挺瘦吧?不知这几位盲童现在好吗? 回复:恩,不但瘦,头发还很多。
85-
```
53+
For reproductivity, we also provided the preprocessed pre-training data in this link. You can directly move the preprocessed pre-training data to `pretrain_data/preprocessed/`.
8654

87-
注意:由于 STC 数据集很大,我们在数据预处理的时候切了其训练集的前 10% 以方便使用者测试 Fine-tune。
8855

89-
### 2.2 Fine-Tune 训练/测试
9056

91-
进行 fine-tune 训练的时候可以选择 fp16 或者 fp32。我们在实验中发现,在使用 fp16 进行训练的时候,需要加载预训练时的动量才能使模型较快收敛,而采用 fp32 训练则不会有这个问题。因此,我们推荐直接使用 fp32 进行 fine-tune 训练。
57+
## 3 Pre-trained Checkpoints
9258

93-
另外,关于内存使用,我们使用了 deepspeed 的 activation checkpointing 以节省内存,相关选项已经在运行脚本中修改。
59+
## 3.1 Base Model
9460

95-
#### ChID:
61+
The original base model is obtained from [huggingface](https://huggingface.co/models). Before runing the code, please use the transforming scripts to transfer the original `pytorch_model.bin` model checkpoints to fit in our `deepspeed + megatron` framework:
9662

97-
```[bash]
98-
bash scripts/chid/finetune_chid_large.sh # for fp16 fine-tune
99-
bash scripts/chid/finetune_chid_large_fp32.sh # for fp32 fine-tune
100-
bash scripts/chid/finetune_chid_large_fp32_multinode.sh # for multi-node fp32 fine-tune
101-
```
63+
```bash
64+
mkdir -p checkpoints/t5-xxl/t5-MP4
10265

103-
#### STC:
104-
```[bash]
105-
bash scripts/language_model/finetune_lm_large.sh # for fp16 fine-tune
106-
bash scripts/language_model/finetune_lm_large_fp32.sh # for fp32 fine-tune
107-
bash scripts/language_model/finetune_lm_large_fp32_multinode.sh # for multi-node fp32 fine-tune
66+
python3 tools/transform.py \
67+
--hf_path ${PATH_TO_PYTORCH_MODLE_BIN}
68+
--save_path "./checkpoints/t5-xxl/t5-MP4"
10869
```
10970

110-
运行脚本之前,需要先将脚本中以下变量更改为实际的路径:
71+
## 3.2 Prompts
11172

112-
```[bash]
113-
DATA_DIR # 预处理后数据的目录
114-
CHECKPOINT_PATH # 预训练结束后模型的路径
115-
RESULTS_DIR # 训练结果的存放处
116-
MODEL_NAME # 给模型起的名字
117-
TOKENIZER_PATH # tokenizer 的路径
118-
```
73+
The pretrained prompts can be obtained from this link. You need to move the pre-tained prompts to `pretrained_prompts/`.
11974

120-
如果要进行多机训练,可能还需要修改
121-
```[bash]
122-
NUM_WORKERS # 节点数量
123-
NUM_GPUS_PER_WORKER # 每个节点的卡数
124-
```
125-
以及 `scripts/host_files/hostfile` 文件。具体格式可以参考 deepspeed 的[官方文档](https://www.deepspeed.ai/getting-started/)
12675

127-
进行测试之前,需要去掉脚本中 `--do-train` 选项,然后可以使用 `--eval_ckpt_path` 选项来指定需要测试的模型。
12876

77+
## 4 Run the code
12978

130-
## 3 Zero-Shot
79+
All scripts are in the directory `scripts`.
13180

132-
### 3.1 数据预处理
81+
Before running the code, please first change the `WORKING_DIR` to the current directory of this repo. If you are runing multiple scripts on a single node, you need to make sure that the `MASTER_PORT` of each script is different.
13382

134-
```[bash]
135-
python3 preprocess_chid_zeroshot.py --data_dir ${PATH_TO_DATA_DIR} --tokenizer_path ${PATH_TO_TOKENIZER} --output_dir ${PATH_TO_OUTPUT}
136-
```
83+
If the checkpoint is successfully loaded, the log printed to the stdout should contain messages like `successfully loaded /path-to-checkpoint/t5-MP4/mp_rank_01_model_states.pt`. Otherwise, `WARNING: could not find the metadata file /***/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random` will display. Note that when you successfully load the model, you will see messages like `The following zero checkpoints paths are missing: ['/path-to-checkpoint/eva/200000/zero_pp_rank_0_mp_rank_00_optim_states.pt',...` which mean optimizer states are not loaded. This **DOES NOT** affect the use of model inference and you can just ignore it.
13784

138-
该文件会将每个候选的成语填入文章相应的空白中,每个空白生成10个新的候选文章。最终,该文件生成的数据格式为:
139-
140-
```[python]
141-
142-
{
143-
"contents": [
144-
[8, 15, ....],
145-
....
146-
], # 所有样本经过 bpe 分词之后 token 对应的 id。
147-
"sids": [
148-
0,
149-
0,
150-
...
151-
1,
152-
1,
153-
...
154-
], # 每个生成出的候选文章对应原来样本的编号
155-
"cids": [
156-
0,
157-
1,
158-
2,
159-
...
160-
9,
161-
0,
162-
1,
163-
...
164-
], # 每个生成出的候选文章对应的成语的编号
165-
"labels": [
166-
3,
167-
2,
168-
...
169-
], # 每个原样本的正确答案编号(0~9之间的整数)
170-
}
171-
```
85+
### 4.1 Tuning
17286

173-
与处理完成后,指定的输出目录下会生成 test.json 文件。
87+
We use the boolq dataset as an example. For t5-xxl model, PT and PPT can run on at least 4 * 32G V100 GPU. FT can run on at least 16 * 32G V100 GPU.
17488

175-
### 3.2 Zero-Shot 测试
89+
```bash
90+
# few-shot 32 samples
91+
bash scripts/boolq/few-shot/ft.sh # Fine-tuning (FT)
92+
bash scripts/boolq/few-shot/pt.sh # Prompt Tuning (PT)
93+
bash scripts/boolq/few-shot/pt_pretrain.sh # Pre-trained Prompt Tuning (PPT)
94+
bash scripts/boolq/few-shot/pt_uni_pretrain.sh # Unified Pre-trained Prompt Tuning (Unified PPT)
17695

177-
```[bash]
178-
bash scripts/chid/zero-shot_chid_large.sh
96+
# full data
97+
bash scripts/boolq/full/ft.sh # Fine-tuning (FT)
98+
bash scripts/boolq/full/pt.sh # Prompt Tuning (PT)
99+
bash scripts/boolq/full/pt_pretrain.sh # Pre-trained Prompt Tuning (PPT)
100+
bash scripts/boolq/full/pt_uni_pretrain.sh # Unified Pre-trained Prompt Tuning (Unified PPT)
179101
```
180102

181-
运行脚本之前,需要先将脚本中以下变量更改为实际的路径:
103+
### 4.2 Pre-training
182104

183-
```[bash]
184-
DATA_DIR # 预处理后数据的目录
185-
CHECKPOINT_PATH # 预训练结束后模型的路径
186-
RESULTS_DIR # 训练结果的存放处
187-
MODEL_NAME # 给模型起的名字
188-
TOKENIZER_PATH # tokenizer 的路径
105+
```bash
106+
bash scripts/pretrain/pretrain_nsp.sh # Next Sentence Prediction
107+
bash scripts/pretrain/pretrain_nss.sh # Next Sentence Selelction
108+
bash scripts/pretrain/pretrain_cls.sh # Single Sentence Classificatin
109+
bash scripts/pretrain/pretrain_nss_uni.sh # Unified Next Sentence Selelction (for Unified PPT)
189110
```
190111

191-
## 4 参考性能
192112

193-
| | Fine-Tune | Zero-Shot |
194-
| ---------- | --------- | --------- |
195-
| CPM-small | 0.657 | 0.433 |
196-
| CPM-medium | 0.695 | 0.524 |
197-
| CPM-large | **0.804** | **0.685** |
198113

199-
## 5 引用
114+
## 5 Cite
200115

201-
```[latex]
202-
@article{cpm-v1,
203-
title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
204-
author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
205-
year={2020}
116+
If you use the code, please cite the following paper:
117+
118+
```latex
119+
@inproceedings{gu2022ppt,
120+
title={PPT: Pre-trained Prompt Tuning for Few-shot Learning},
121+
author={Gu, Yuxian and Han, Xu and Liu, Zhiyuan and Huang, Minlie},
122+
booktitle={Proceedings of ACL},
123+
year={2022}
206124
}
207125
```
126+

scripts/tools/preprocess_pretrain_cls.sh

-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
WOKRING_DIR=/home/guyuxian/PPT-origin/
22

3-
OUTPUT_PATH=""
4-
53

64
OPTS=""
75
OPTS+=" --input ${WOKRING_DIR}/pretrain_data/raw/openwebtext.txt"

scripts/tools/preprocess_pretrain_nsp.sh

-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
WOKRING_DIR=/home/guyuxian/PPT-origin/
22

3-
INPUT_PATH=""
4-
OUTPUT_PATH=""
5-
63

74
OPTS=""
85
OPTS+=" --input ${WOKRING_DIR}/pretrain_data/raw/openwebtext.txt"

scripts/tools/preprocess_pretrain_nss.sh

-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
WOKRING_DIR=/home/guyuxian/PPT-origin/
22

3-
INPUT_PATH=""
4-
OUTPUT_PATH=""
5-
63

74
OPTS=""
85
OPTS+=" --input ${WOKRING_DIR}/pretrain_data/raw/openwebtext.txt"

scripts/tools/preprocess_pretrain_nss_uni.sh

-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
WOKRING_DIR=/home/guyuxian/PPT-origin/
22

3-
INPUT_PATH=""
4-
OUTPUT_PATH=""
5-
63

74
OPTS=""
85
OPTS+=" --input ${WOKRING_DIR}/pretrain_data/raw/openwebtext.txt"
File renamed without changes.

0 commit comments

Comments
 (0)