All the source code about language model, will include both ngram lm and neural network lm. The code is just used to demostrate the details about lm algorithms, not for industry usage. And all the code is written in python. For neural network lm, the code is based on pytorch. Hope that will be useful for anyone who want to learn about language model.
For ngram lm, there are two versions of code. The first version of the code is used to train ngram lm in a single machine. The second version of the code is based on Spark that can run on servers.
https://zhpacer.github.io/blog/2021/09/13/ngram-language-model.html
The code is in NgramLm/*. There are three python files under the directory:
ngram_count.py -> contains the code to generate ngram count(1-gram,2-gram,3-gram,4-gram,5-gram etc)
ngram_train.py -> contains the code to train a ngram language model from the ngram count generated by ngram_count.py
ngram.py -> contains the code to use a ngram language model, can be used to caculate the probability/PPL for a sentence
NOTE: only the absolute discount algorithm is implemented here. With this example, I think anyone can implement other discount algorithms quickly.
DEMO USAGE-ngram model train:
python3 ngram_train.py -input ../corpus/europarl-v7.en.rand -order 3 -count ./eu_test_o3.count -lm ./eu_test_o3.lm
-order -> the order for lm training
-input -> the input text for lm training
-count -> the ngram count output file
-lm -> the final lm file
DEMO USAGE-ngram model inference:
python3 ngram.py -lm ./eu_test_o3.lm -order 3 -text ./input_puretxt -output ./en_sample.ppl
-order -> the order of the trained lm
-text -> the input text for probability/PPL caculation
-output -> the result file
The code is in NgramLm_Spark/*. There are only one python files under the directory:
ngram_lm_train.py -> a spark version of the single machine version
DEMO USAGE-ngram model train:
spark-submit ngram_lm_train.py -order 5 -input ../corpus/en_sample.txt -lm ./en_sample.5lm -count ./en_sample.5count
-order -> the order for lm training
-input -> the input text for lm training
-count -> the ngram count output file
-lm -> the final lm file
Will start soon.