Chinese–English NMT (RNN & Transformer)

This repository implements Chinese↔English neural machine translation with:

Seq2Seq (GRU/LSTM) + attention (dot, multiplicative, additive)
Transformer (from scratch) with ablations (absolute vs relative/rotary positions, LayerNorm vs RMSNorm)
Decoding: greedy & beam search
Training policies: Teacher Forcing vs Free Running
Evaluation: BLEU (corpus-level), plus basic latency/throughput logging

Detected JSONL schema (first item of each split):

train_10k.jsonl → [{'en': '1929 or 1989?', 'zh': '1929年还是1989年?', 'index': 0}, {'en': 'PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening.', 'zh': '巴黎-随着经济危机不断加深和蔓延，整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。', 'index': 1}, {'en': 'At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.', 'zh': '一开始，很多人把这次危机比作1982年或1973年所发生的情况，这样得类比是令人宽心的，因为这两段时期意味着典型的周期性衰退。', 'index': 2}]
valid.jsonl     → [{'en': 'Last week, the broadcast of period drama “Beauty Private Kitchen” was temporarily halted, and accidentally triggered heated debate about faked ratings of locally produced dramas.', 'zh': '上周，古装剧《美人私房菜》临时停播，意外引发了关于国产剧收视率造假的热烈讨论。', 'index': 0}, {'en': 'Civil rights group issues travel warning for Missouri', 'zh': '民权团体针对密苏里州发出旅行警告', 'index': 1}, {'en': "The National Association for the Advancement of Colored People has put out an alert for people of color traveling to Missouri because of the state's discriminatory policies and racist attacks.", 'zh': '由于密苏里州的歧视性政策和种族主义袭击，美国有色人种促进协会 (NAACP) 向准备前往密苏里州出游的有色人群发出旅行警告。', 'index': 2}]
test.jsonl      → [{'en': 'Records indicate that HMX-1 inquired about whether the event might violate the provision.', 'zh': '记录指出 HMX-1 曾询问此次活动是否违反了该法案。', 'index': 0}, {'en': '"One question we asked was if it was a violation of the Hatch Act and were informed it was not," the commander wrote.', 'zh': '该指挥官写道“我们问的一个问题是这是否违反了《哈奇法案》，并被告知没有违反。”', 'index': 1}, {'en': '"Sounds like you are locked," the Deputy Commandant replied.', 'zh': '“听起来你被锁住了啊，”副司令回复道。', 'index': 2}]

Each line should be a JSON object containing at least src and tgt keys (Chinese source and English target). If your keys differ (e.g., zh, en), adjust nmt/data.py accordingly.

Quick Start

Install deps (Python ≥3.9):
```
pip install -r requirements.txt
```

Place data (or symlink) under data/:

ln -s /path/to/train_10k.jsonl data/train_10k.jsonl
ln -s /path/to/valid.jsonl     data/valid.jsonl
ln -s /path/to/test.jsonl      data/test.jsonl

Train RNN (GRU, additive attention):

python nmt/train_t5.py --train data/train_10k.jsonl --valid data/valid.jsonl --tokenizer spm        --attn additive --cell gru --layers 2 --hidden 512 --emb 256 --epochs 10 --batch 128

Train Transformer (absolute pos, LayerNorm):

python nmt/train_transformer.py --train data/train_10k.jsonl --valid data/valid.jsonl --tokenizer spm        --d_model 512 --nhead 8 --num_layers 6 --epochs 10 --batch 128 --pos abs --norm ln

Evaluate on test set (BLEU):

python nmt/evaluate.py --ckpt runs/seq2seq/latest.pt --test data/test.jsonl --decode beam --beam_size 5
python nmt/evaluate_RNN.py --ckpt runs/transformer/latest.pt --test data/test.jsonl --decode greedy

One-click inference (required by assignment):

python inference.py --ckpt runs/transformer/latest.pt --src "今天天气很好，我们去公园散步吧。"

Ablations / knobs:
- Attention: --attn dot|mul|add
- Decoding: --decode greedy|beam --beam_size 5
- Teacher Forcing: --teacher_forcing 0.5 (set to 0 for free running)
- Transformer position: --pos abs|rope (rope=rotary relative)
- Normalization: --norm ln|rms
- Scaling: --d_model, --ffn, --num_layers, --batch, --lr

File Map

nmt/
  data.py            # JSONL reader, tokenizer (SentencePiece or whitespace/jieba fallback), vocab I/O
  metrics.py         # BLEU (corpus-level), timing utils
  utils.py           # training utils, seed control, gradient clip, schedulers
  decode.py          # greedy & beam search
  models/
    seq2seq.py       # Encoder/Decoder + attention (dot/mul/add)
    transformer.py   # Transformer with abs & rotary positions; LayerNorm/RMSNorm options
  train_seq2seq.py   # RNN training loop
  train_transformer.py
  evaluate.py        # loads any ckpt, runs decoding over test.jsonl, computes BLEU
inference.py         # one-click script: loads ckpt and prints translation
requirements.txt
README.md

Notes

Tokenization: defaults to training a SentencePiece model (unigram) on the training split. If sentencepiece is unavailable, code falls back to simple tokenizers (jieba for Chinese, whitespace for English). You can force mode with --tokenizer spm|basic.
Pretrained embeddings: optional via --pretrained_vecs (expects word2vec text format). If not provided, embeddings are learned from scratch.
Checkpoints & logs: stored under runs/{seq2seq|transformer}/.
Reproducibility: --seed 2025 by default.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
data		data
nmt		nmt
run_scripts		run_scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
init.txt		init.txt
requirements.txt		requirements.txt
test.py		test.py
train_transformer.log		train_transformer.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese–English NMT (RNN & Transformer)

Quick Start

File Map

Notes

About

Uh oh!

Releases

Packages

Languages

1190201205/AP004_project

Folders and files

Latest commit

History

Repository files navigation

Chinese–English NMT (RNN & Transformer)

Quick Start

File Map

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages