Skip to content
cjrd edited this page Jun 13, 2012 · 19 revisions

The Hierarchical Dirichlet Process can be used to form a nonparametric topic model as discussed in Teh et al. 2005. TMA currently uses Chong Wang's HDP Implementation with some minor changes for compatibility.

# Input

# max iterations

The maximum number of Gibbs sampling iterations.

# initial number of topics

The initial number of topics to use for the HDP model -- set to 0 in order to use no prior information.

# sample hyperparameters

Sample the first and second level concentration parameters -- see Escobar and West, 1995 (p.285) for sampling technique.

# gamma shape

The Gamma shape parameter used for drawing the first-level concentration parameter.

# gamma scale

The Gamma scale parameter used for drawing the first-level concentration parameter.

# alpha shape

The Gamma shape parameter used for drawing the second-level concentration parameter.

# alpha scale

The Gamma scale parameter used for drawing the second-level concentration parameter.

# topic Dirichlet parameter

The topic Dirichlet parameter is the concentration parameter used for the topic draws, i.e. ( \phi_i \sim \mbox{Dir}(\eta * \mathbf{v})) where (\phi_i) is topic i, (\eta) is the topic Dirichlet parameter, and v is the vocabulary vector.

# perform split merge

Use split-merge MCMC as described in A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process (2012).

# Output

The TMA output is described in TMA Pages. The output from the Wang' HDP code is described below:

# mode

The mode.bin file contains a binary representation of the HDP model at the mode of the model's likelihood. This binary file can be used for inference on new data.

# mode-word-assignments

The mode-word-assignments.dat file contains the topic assignment of each word in the format: d w z t d: document id w: word id z: topic index t: table index

# mode-topics

A K x W file (topics x vocabulary) with word counts for each topic from the mode HDP model.

# state

Information to monitor the MCMC, e.g. model likelihood.

# Relationships

The data relationships determined by HDP are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.

#### document-topic Each entry (i,j) of the document x topic probability matrix is the normalized number of terms of topic j in document i as obtained from topic index of the mode-word-assignments.dat output file.

#### topic-term Each entry (i,j) of the topic x term probability matrix is the normalized number of times that term j was assigned to topic i as obtained from the mode-topics.dat output file.

# Perplexity

WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.

Determining the held-out test perplexity is described generally in Perplexity. For the HDP, we determine the held-out perplexity by evaluating the held-out data likelihood via test mode with Wang's HDP code [TODO expand this section with more detail].

Clone this wiki locally