HDP

The Hierarchical Dirichlet Process can be used to form a nonparametric topic model as discussed in Teh et al. 2005. TMA currently uses Chong Wang's HDP Implementation with some minor changes for compatibility.

# Input

# max iterations

The maximum number of Gibbs sampling iterations.

# initial number of topics

The initial number of topics to use for the HDP model -- set to 0 in order to use no prior information.

# sample hyperparameters

Sample the first and second level concentration parameters -- see Escobar and West, 1995 (p.285) for sampling technique.

# gamma shape

The Gamma shape parameter used for drawing the first-level concentration parameter.

# gamma scale

The Gamma scale parameter used for drawing the first-level concentration parameter.

# alpha shape

The Gamma shape parameter used for drawing the second-level concentration parameter.

# alpha scale

The Gamma scale parameter used for drawing the second-level concentration parameter.

# topic Dirichlet parameter

The topic Dirichlet parameter is the concentration parameter used for the topic draws, i.e. ( \phi_i \sim \mbox{Dir}(\eta * \mathbf{v})) where (\phi_i) is topic i, (\eta) is the topic Dirichlet parameter, and v is the vocabulary vector.

# perform split merge

Use split-merge MCMC as described in A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process (2012).

# Output

The TMA output is described in TMA Pages. The output from the Wang' HDP code is described below:

# mode

The mode.bin file contains a binary representation of the HDP model at the mode of the model's likelihood. This binary file can be used for inference on new data.

# mode-word-assignments

The mode-word-assignments.dat file contains the topic assignment of each word in the format: d w z t d: document id w: word id z: topic index t: table index

# mode-topics

A K x W file (topics x vocabulary) with word counts for each topic from the mode HDP model.

# state

Information to monitor the MCMC, e.g. model likelihood.

# Relationships

The data relationships determined by HDP are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.

#### document-topic Each entry (i,j) of the document x topic probability matrix is the normalized number of terms of topic j in document i as obtained from topic index of the mode-word-assignments.dat output file.

#### topic-term Each entry (i,j) of the topic x term probability matrix is the normalized number of times that term j was assigned to topic i as obtained from the mode-topics.dat output file.

# Perplexity

WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.

Determining the held-out test perplexity is described generally in Perplexity. For the HDP, we determine the held-out perplexity by evaluating the held-out data likelihood via test mode with Wang's HDP code [TODO expand this section with more detail].

HDP

# Input

# max iterations

# initial number of topics

# sample hyperparameters

# gamma shape

# gamma scale

# alpha shape

# alpha scale

# topic Dirichlet parameter

# perform split merge

# Output

# mode

# mode-word-assignments

# mode-topics

# state

# Relationships

# Perplexity

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally