|
| 1 | +Tools for computing distributed representtion of words |
| 2 | +------------------------------------------------------ |
1 | 3 |
|
2 |
| -We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG). |
| 4 | +We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts. |
3 | 5 |
|
4 |
| -Given a text corpus, the word2vec program learns a vector for every word using the Continuous |
5 |
| -Bag-of-Words or the Skip-Gram model. The user needs to specify the following: |
| 6 | +Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous |
| 7 | +Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following: |
6 | 8 | - desired vector dimensionality
|
7 | 9 | - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
|
8 |
| - - Whether hierarchical sampling is used |
9 |
| - - Whether negative sampling is used, and if so, how many negative samples should be used |
10 |
| - - A threshold for downsampling frequent words |
11 |
| - - Number of threads to use |
12 |
| - - Whether to save the vectors in a text format or a binary format |
| 10 | + - training algorithm: hierarchical softmax and / or negative sampling |
| 11 | + - threshold for downsampling the frequent words |
| 12 | + - number of threads to use |
| 13 | + - the format of the output word vector file (text or binary) |
13 | 14 |
|
| 15 | +Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. |
14 | 16 |
|
15 |
| -Thus the programs require a very modest number of parameter. In particular, learning rates |
16 |
| -need not be selected. |
| 17 | +The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training |
| 18 | +is finished, the user can interactively explore the similarity of the words. |
17 | 19 |
|
18 |
| -The file demo-word.sh downloads a small (100MB) text corpus, and trains a 200-dimensional CBOW model |
19 |
| -with a window of size 5, negative sampling with 5 negative samples, a downsampling of 1e-3, 12 threads, and binary files. |
20 |
| - |
21 |
| -./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -sample 1e-3 -threads 12 -binary 1 |
22 |
| - |
23 |
| - |
24 |
| -Then, to evaluate the fidelity of our vectors, we can run the command, which will run |
25 |
| -a battery of tests on the vectors to determine their fidelity. The tests evaluate |
26 |
| -the vectors' ability to perform linear analogies. |
27 |
| - |
28 |
| -./distance vectors.bin |
| 20 | +More information about the scripts is provided at https://code.google.com/p/word2vec/ |
29 | 21 |
|
0 commit comments