Ref: 52nlp
-
Download Wiki Dump:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-
Prerocess:
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
To keep some basic punctuations:
replace
PAT_ALPHABETIC = re.compile('.*', re.UNICODE)
in ~/.local/lib/python2.7/site-packages/gensim/utils.py with
PAT_ALPHABETIC = re.compile('((\w|,|;|\.|\?|!|:|\'|\"|-|\\(|\\))+)', re.UNICODE)
and
python nltk_tokenizer.py wiki.en.text wiki.en.text.new
-
Train
python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector