-
Notifications
You must be signed in to change notification settings - Fork 26
HDP
The Hierarchical Dirichlet Process can be used to form a nonparametric topic model as discussed in Teh et al. 2005. TMA currently uses Chong Wang's HDP Implementation with some minor changes for compatibility.
# Input
- max iterations
- initial number of topics
- sample hyperparameters
- gamma shape
- gamma scale
- alpha shape
- alpha scale
- topic Dirichlet parameter
- perform split merge
# max iterations
The maximum number of Gibbs sampling iterations.
# initial number of topics
The initial number of topics to use for the HDP model -- set to 0 in order to use no prior information.
# sample hyperparameters
Sample the first and second level concentration parameters -- see Escobar and West, 1995 (p.285) for sampling technique.
# gamma shape
The Gamma shape parameter used for drawing the first-level concentration parameter.
# gamma scale
The Gamma scale parameter used for drawing the first-level concentration parameter.
# alpha shape
The Gamma shape parameter used for drawing the second-level concentration parameter.
# alpha scale
The Gamma scale parameter used for drawing the second-level concentration parameter.
# topic Dirichlet parameter
The topic Dirichlet parameter is the concentration parameter used for the topic draws, i.e. ( \phi_i \sim \mbox{Dir}(\eta * \mathbf{v})) where (\phi_i) is topic i, (\eta) is the topic Dirichlet parameter, and v is the vocabulary vector.
# perform split merge
Use split-merge MCMC as described in A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process (2012).
# Output
The TMA output is described in TMA Pages. The output from the Wang' HDP code is described below:
# mode
The mode.bin file contains a binary representation of the HDP model at the mode of the model's likelihood. This binary file can be used for inference on new data.
# mode-word-assignments
The mode-word-assignments.dat file contains the topic assignment of each word in the format: d w z t d: document id w: word id z: topic index t: table index
# mode-topics
A K x W file (topics x vocabulary) with word counts for each topic from the mode HDP model.
# state
Information to monitor the MCMC, e.g. model likelihood.
# Relationships
The data relationships determined by HDP are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.
#### document-topic Each entry (i,j) of the document x topic probability matrix is the normalized number of terms of topic j in document i as obtained from topic index of the mode-word-assignments.dat output file.
#### topic-term Each entry (i,j) of the topic x term probability matrix is the normalized number of times that term j was assigned to topic i as obtained from the mode-topics.dat output file.
# Perplexity
WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.
Determining the held-out test perplexity is described generally in Perplexity. For the HDP, we determine the held-out perplexity by evaluating the held-out data likelihood via test mode with Wang's HDP code [TODO expand this section with more detail].