Skip to content
cjrd edited this page Jun 13, 2012 · 15 revisions

The Correlated Topic Model is an LDA-like model that allows uses draws from a logistic-normal distribution instead of a Dirichlet in order to allow for correlations between the topics, e.g. an article that contains a "marriage" topic is more likely to contain a "love" topic than a "Gibbs sampling" topic. See Blei and Lafferty, 2007 for a thorough discussion about the CTM. TMA uses Dave Blei's [CTM implementation].(http://www.cs.princeton.edu/~blei/lda-c/index.html)

# Input

# number of topics

The number of topics (thematic groups) present in the analysis data. Several techniques exist to set this parameter; a common technique is to evaluate the held-out likelihood or perplexity.

# topic initialization

This parameter describes how the topics will be initialized:

  • "random" initializes each topic randomly
  • "seeded" initializes each topic to a distribution smoothed from a randomly chosen document

# covariance estimation technique

This parameter determines whether to use the mle of the covariance matrix of the Gaussian or use a regression-based 'shrinkage' approach -- both methods are discussed in Blei and Lafferty, 2007.

# maximum variational iterations

The maximum number of iterations of coordinate ascent variational inference for a single document as described in Blei et al. 2003. Providing a value of -1 indicates that TMA should perform "full" variational inference, until the variational convergence criterion is met.

# variational convergence threshold

The convergence criteria for variational inference. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that the score is the lower bound on the likelihood for a particular document.

# maximum EM iterations

The maximum number of iterations of variational EM.

# EM convergence threshold

The convergence criteria for varitional EM. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that "score" is the lower bound on the likelihood for the whole corpus.

# Output

From the README in Blei's CTM code:

Once EM has converged, the model directory will be populated with several files that can be used to examine the resulting model fit, for example to make topic graph figures or compute similarity between documents. All the files are stored in row major format. They can be read into R with the command:

 x <- matrix(scan(FILENAME), byrow=T, nrow=NR, ncol=NC),

where FILENAME is the file, NR is the number of rows, and NC is the number of columns.

Below, K is the number of topics, D is the number of documents, and V is the number of terms in the vocabulary. The relevant output files are as follows:

#### final-cov This file contains the final K-1 x K-1 covariance matrix of the multivariate Gaussian.

#### final-inv-cov This file contains the final inverse K-1 x K-1 covariance matrix of the multivariate Gaussian.

#### final-lambda

D x K matrix of the variational mean parameter for each document's topic proportions In other words, this file contains the mean parameters for the independent multivariate Gaussian for each document.

#### final-log-beta This is a K X V topic matrix. The ith row contains the log probabilities of the words for the ith topic.

#### final-mu This file contains the K-1 mean vector of the logistic normal over the topic proportions, as opposed to the final-lambda file which contains the variational mean parameter for each document's topic proportions.

#### final-nu This file contains the D x K matrix of the variational variance parameter for each document in the corpus, where each row corresponds do the diagonal of the corresponding covariance matrix.

# Relationships

The data relationships determined by CTM are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.

#### document-topic The document-topic probability matrix used to compute the appropriate Data Relationships is obtained by sampling from the variational distribution in order to approximate the posterior expectation of the document-level multinomials distributions -- referred to as ( \theta_m ) for document m. As described in Data Relationships, we compute the Hellinger Distance between the document-level multinomials in order to determine document-document and document-topic relationships. Therefore we compute the expectation of ( \sqrt{\theta_n} ) for all n=1,2,...,N. Taking ( \lambda_n ) (the variational mean parameter) and ( \nu_n ) (the variational covariance diagonal) from the corresponding output files (final-lambda and final-nu), we have:

[ E[\sqrt{\theta_n}] = \sum_k \left(\mbox{logistic-normal(mean=}\lambda_n, \mbox{cov=}\nu_n\mathds{1}) \right)^{\frac{1}{2}} ]

for k=1,2,...,K samples from the logistic-normal distribution. From an empirical test of convergence, we found K=100 produces stable results.

#### topic-term The topic-term matrix used to determine the data relationships is directly taken from the final-log-beta output file.

# Perplexity

WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.

Determining the held-out test perplexity is described generally in Perplexity. For CTM, we determine the held-out perplexity by evaluating the held-out data likelihood via inf (inference) mode with Blei's CTM code [TODO expand this section with more detail].

Clone this wiki locally