GEC-t2t

Grammar Error Correction Based on Tensor2Tensor
A temp project of Deecamp.

Train

The overall training procedure includes pretrain and finetune.

Subword-nmt
The input of this model should be a BPE format.
Pretrain
In order to improve performance of this seq2seq task, the model needs to pretrain based on a large native corpus. The source sentences are generated by denoising on native corpus. The denoising method refers to https://github.com/zhawe01/fairseq-gec. The training step of pretrain depends on the size of native corpus and batchsize parameter, which should include one epoch of native corpus.
Tips: The batchsize refers to the number of tokens.
Finetune
After pretrain, the model should be finetuned over gec corpus, such as CONLL-14.
The training step depends on the loss and performance on your task.

We use the tensorflow-serving on docker.

Name	Name	Last commit message	Last commit date
Latest commit SannyZhou Update README.md Aug 23, 2019 04e0245 · Aug 23, 2019 History 6 Commits
src	src	Delete DS_store	Aug 22, 2019
subword_nmt	subword_nmt	update	Aug 22, 2019
tensor2tensor	tensor2tensor	update	Aug 22, 2019
.gitattributes	.gitattributes	Initial commit	Aug 22, 2019
README.md	README.md	Update README.md	Aug 23, 2019
avg_model.py	avg_model.py	update	Aug 22, 2019
bpe_to_origin.py	bpe_to_origin.py	update	Aug 22, 2019
conf.py	conf.py	update	Aug 22, 2019
docker_tfserving_cmd.sh	docker_tfserving_cmd.sh	update	Aug 22, 2019
export_model.py	export_model.py	update	Aug 22, 2019
new_query.py	new_query.py	debug	Aug 22, 2019
query_server.py	query_server.py	update	Aug 22, 2019
query_test.py	query_test.py	update	Aug 22, 2019
test.py	test.py	update	Aug 22, 2019
train.py	train.py	update	Aug 22, 2019