Skip to content

ysmiao/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ref: 52nlp

  1. Download Wiki Dump:

    https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

  2. Prerocess:

     python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
    

    To keep some basic punctuations:

    replace

     PAT_ALPHABETIC = re.compile('.*', re.UNICODE)
    

    in ~/.local/lib/python2.7/site-packages/gensim/utils.py with

     PAT_ALPHABETIC = re.compile('((\w|,|;|\.|\?|!|:|\'|\"|-|\\(|\\))+)', re.UNICODE)
    

    and

     python nltk_tokenizer.py wiki.en.text wiki.en.text.new
    
  3. Train

     python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
    

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages