Skip to content

Commit 2442024

Browse files
removed some specific information from README, as I just changed the scritps
git-svn-id: http://word2vec.googlecode.com/svn/trunk@28 c84ef02e-58a5-4c83-e53e-41fc32d635eb
1 parent 3c464c2 commit 2442024

File tree

1 file changed

+13
-21
lines changed

1 file changed

+13
-21
lines changed

README.txt

+13-21
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,21 @@
1+
Tools for computing distributed representtion of words
2+
------------------------------------------------------
13

2-
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG).
4+
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.
35

4-
Given a text corpus, the word2vec program learns a vector for every word using the Continuous
5-
Bag-of-Words or the Skip-Gram model. The user needs to specify the following:
6+
Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
7+
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
68
- desired vector dimensionality
79
- the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
8-
- Whether hierarchical sampling is used
9-
- Whether negative sampling is used, and if so, how many negative samples should be used
10-
- A threshold for downsampling frequent words
11-
- Number of threads to use
12-
- Whether to save the vectors in a text format or a binary format
10+
- training algorithm: hierarchical softmax and / or negative sampling
11+
- threshold for downsampling the frequent words
12+
- number of threads to use
13+
- the format of the output word vector file (text or binary)
1314

15+
Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.
1416

15-
Thus the programs require a very modest number of parameter. In particular, learning rates
16-
need not be selected.
17+
The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
18+
is finished, the user can interactively explore the similarity of the words.
1719

18-
The file demo-word.sh downloads a small (100MB) text corpus, and trains a 200-dimensional CBOW model
19-
with a window of size 5, negative sampling with 5 negative samples, a downsampling of 1e-3, 12 threads, and binary files.
20-
21-
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -sample 1e-3 -threads 12 -binary 1
22-
23-
24-
Then, to evaluate the fidelity of our vectors, we can run the command, which will run
25-
a battery of tests on the vectors to determine their fidelity. The tests evaluate
26-
the vectors' ability to perform linear analogies.
27-
28-
./distance vectors.bin
20+
More information about the scripts is provided at https://code.google.com/p/word2vec/
2921

0 commit comments

Comments
 (0)