This repository implements Chinese↔English neural machine translation with:
- Seq2Seq (GRU/LSTM) + attention (dot, multiplicative, additive)
- Transformer (from scratch) with ablations (absolute vs relative/rotary positions, LayerNorm vs RMSNorm)
- Decoding: greedy & beam search
- Training policies: Teacher Forcing vs Free Running
- Evaluation: BLEU (corpus-level), plus basic latency/throughput logging
Detected JSONL schema (first item of each split):
train_10k.jsonl → [{'en': '1929 or 1989?', 'zh': '1929年还是1989年?', 'index': 0}, {'en': 'PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening.', 'zh': '巴黎-随着经济危机不断加深和蔓延,整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。', 'index': 1}, {'en': 'At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.', 'zh': '一开始,很多人把这次危机比作1982年或1973年所发生的情况,这样得类比是令人宽心的,因为这两段时期意味着典型的周期性衰退。', 'index': 2}]
valid.jsonl → [{'en': 'Last week, the broadcast of period drama “Beauty Private Kitchen” was temporarily halted, and accidentally triggered heated debate about faked ratings of locally produced dramas.', 'zh': '上周,古装剧《美人私房菜》临时停播,意外引发了关于国产剧收视率造假的热烈讨论。', 'index': 0}, {'en': 'Civil rights group issues travel warning for Missouri', 'zh': '民权团体针对密苏里州发出旅行警告', 'index': 1}, {'en': "The National Association for the Advancement of Colored People has put out an alert for people of color traveling to Missouri because of the state's discriminatory policies and racist attacks.", 'zh': '由于密苏里州的歧视性政策和种族主义袭击,美国有色人种促进协会 (NAACP) 向准备前往密苏里州出游的有色人群发出旅行警告。', 'index': 2}]
test.jsonl → [{'en': 'Records indicate that HMX-1 inquired about whether the event might violate the provision.', 'zh': '记录指出 HMX-1 曾询问此次活动是否违反了该法案。', 'index': 0}, {'en': '"One question we asked was if it was a violation of the Hatch Act and were informed it was not," the commander wrote.', 'zh': '该指挥官写道“我们问的一个问题是这是否违反了《哈奇法案》,并被告知没有违反。”', 'index': 1}, {'en': '"Sounds like you are locked," the Deputy Commandant replied.', 'zh': '“听起来你被锁住了啊,”副司令回复道。', 'index': 2}]
Each line should be a JSON object containing at least src and tgt keys (Chinese source and English target).
If your keys differ (e.g., zh, en), adjust nmt/data.py accordingly.
-
Install deps (Python ≥3.9):
pip install -r requirements.txt
-
Place data (or symlink) under
data/:ln -s /path/to/train_10k.jsonl data/train_10k.jsonl ln -s /path/to/valid.jsonl data/valid.jsonl ln -s /path/to/test.jsonl data/test.jsonl
-
Train RNN (GRU, additive attention):
python nmt/train_t5.py --train data/train_10k.jsonl --valid data/valid.jsonl --tokenizer spm --attn additive --cell gru --layers 2 --hidden 512 --emb 256 --epochs 10 --batch 128
-
Train Transformer (absolute pos, LayerNorm):
python nmt/train_transformer.py --train data/train_10k.jsonl --valid data/valid.jsonl --tokenizer spm --d_model 512 --nhead 8 --num_layers 6 --epochs 10 --batch 128 --pos abs --norm ln
-
Evaluate on test set (BLEU):
python nmt/evaluate.py --ckpt runs/seq2seq/latest.pt --test data/test.jsonl --decode beam --beam_size 5 python nmt/evaluate_RNN.py --ckpt runs/transformer/latest.pt --test data/test.jsonl --decode greedy
-
One-click inference (required by assignment):
python inference.py --ckpt runs/transformer/latest.pt --src "今天天气很好,我们去公园散步吧。" -
Ablations / knobs:
- Attention:
--attn dot|mul|add - Decoding:
--decode greedy|beam --beam_size 5 - Teacher Forcing:
--teacher_forcing 0.5(set to 0 for free running) - Transformer position:
--pos abs|rope(rope=rotary relative) - Normalization:
--norm ln|rms - Scaling:
--d_model,--ffn,--num_layers,--batch,--lr
- Attention:
nmt/
data.py # JSONL reader, tokenizer (SentencePiece or whitespace/jieba fallback), vocab I/O
metrics.py # BLEU (corpus-level), timing utils
utils.py # training utils, seed control, gradient clip, schedulers
decode.py # greedy & beam search
models/
seq2seq.py # Encoder/Decoder + attention (dot/mul/add)
transformer.py # Transformer with abs & rotary positions; LayerNorm/RMSNorm options
train_seq2seq.py # RNN training loop
train_transformer.py
evaluate.py # loads any ckpt, runs decoding over test.jsonl, computes BLEU
inference.py # one-click script: loads ckpt and prints translation
requirements.txt
README.md
- Tokenization: defaults to training a SentencePiece model (unigram) on the training split. If
sentencepieceis unavailable, code falls back to simple tokenizers (jieba for Chinese, whitespace for English). You can force mode with--tokenizer spm|basic. - Pretrained embeddings: optional via
--pretrained_vecs(expects word2vec text format). If not provided, embeddings are learned from scratch. - Checkpoints & logs: stored under
runs/{seq2seq|transformer}/. - Reproducibility:
--seed 2025by default.