Skip to content

Latest commit

 

History

History
37 lines (25 loc) · 2.35 KB

File metadata and controls

37 lines (25 loc) · 2.35 KB

word2vec

Keras implementation of word2vec, including full data processing pipeline, where impelementation closely follows TF tutorial

Implementation primarily for building intuition for both keras and word2vec. Includes both data processing and model estimation pipelines.

To run the data processing, must run the submit.py script. This script reads in a text file (using the path_to_text_file parameter), then does a grid search over parameter grids at top of script.

If a user wants to test a single combination of parameter values, work with the commands inside the for loop, which runs preprocessing + model estimation by calling run.py for a fixed set of parameter values.

submit.py allows users to set a series of parameters in word2vec runs. There are three sets of parameters Data Processing, word2Vec, and Model Tuning parameters:

  • Data Processing parameters deal with the amount of training used in each batch, as well as, performance in pre-fetching data for each batch as the model runs.
  • word2vec parameters deal with word2vec specific parameters such as the number of negative samples.
  • Model Tuning parameters deal with more general model performance issues such as number of epochs for each model run.

DATA PROCESSING PARAMETERS

BATCH_SIZE : number of training samples
BUFFER_SIZE : size of buffer to be filled while prior batch is running
AUTOTUNE : tuning parameter for number of data elements to fetch into buffer

DATA PROCESSING PARAMETERS

NUM_NS : number of negative samples using in Noise Contrastive Estimation procedure
T_PARAM : threshold value used to down-weight high frequency words (see equation 5)
VOCAB_SIZE : number of words in vocabulary
WINDOW_SIZE : number of words before and after target word to include in context
EMBEDDING_DIM : number of hidden layers in intermediate layer (or number of vectors in word embedding)
SEQUENCE_LENGTH : length of each sentence

MODEL TUNING

EPOCHS : number of parameter updates
SEED : random seed

Folder Structure

submit.py runs run.py in src folder, assuming the same folder structure as is in this repository.