Skip to content

Quantitative Analysis

cjrd edited this page Jun 12, 2012 · 29 revisions

# Likelihood

Likelihood is the combined [log] probability of the data given the topic model. See the specific topic models for the exact implementation of the likelihood statistics presented in the model table. Generally, the likelihood for topic k is calculated as

[ \sum_{i} \mbox{log} p(w_i | \mathbf{z}_k) ]

where ( {w_i} ) is the set of terms listed in the table.

  • Note: less negative (larger positive) likelihood scores indicate the topic model fits the given data better
  • Note: negative likelihoodscores indicate that the log-likelihood is used.
  • Note: it is possible to obtain -infinity likelihood scores in the TMA

# Topic Coherence

"Topic Coherence" is a metric formulated in the paper, Optimizing Semantic Coherence in Topic Models (equation 1), to automatically evaluate the coherence of the M most probable words for a given topic via:

[ \sum^M_{m=2} \sum^{m-1}{l=1} \mbox{log} \frac{D{m,l} + 1}{D_l} ]

for each of the M choose 2 combinations of terms for the topic, where ( D_{m,l} ) is the number of documents in the corpus that contain both term 'm' and term 'l' and ( D_{m,l} ) is the number of documents in the corpus that cotain term 'l'.

  • Note: less negative scores indicate a better topic
  • Note: this coherence score works best if the data has been preprocessed so that rare co-occurring terms are removed so that that we don't have e.g. ( D_{m,l} = D_l = 1).

# Wikipedia Coherence

"Wikipedia Coherence" is a metric formulated in the paper, Automatic Evaluation of Topic Coherence (see 4.2 Wikipedia: "Term Co-occurrence" (PMI)), which uses the median of:

[ \mbox{log}\frac{N_{w1,w2} N}{N_{w1}N_{w2}} ]

as a measure of topic coherence for the most probable terms in a topic. (N) is the number of wikipedia document windows in the corpus, (N_{w1,w2}) is the number of wikipedia document windows that contain both term 'w1' and term 'w2', and N_{wi}) (i = 1,2) is the number of wikipedia document windows that contain wi. In the original publication; the authors use a sliding 10-word window over all wikipedia documents as their "wikipedia document windows." To avoid the arbitrary nature of using ten words, we search for cooccrence in the entire abstracts of all wikipedia documents. See TODO (insert database URL) for the wikipedia database.

  • Note: larger scores indicate a better topic
  • Note: a -1 indicates that a term in the topic was not found in the wikipedia abstracts and therefore a valid coherence score could not be determined.

# Bing Coherence

"Bing Coherence" is a metric based from a similar metric formulated in the paper, Automatic Evaluation of Topic Coherence (see 4.3 Search engine-based similarity: "Google title matches (Titles)") --- the central difference is that the authors of the paper used Google where we use Microsoft's Bing. This method queries Bing for the topic in its entirety, after first de-stemming the terms if necessary, and the Bing coherence for each topic is simply the number of cooccerences of unqiue topic terms in the top 50 search results.

  • Note: Larger scores indicate a better topic.

# Perplexity

WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.

Perplexity -- a commonly used by metric in language modeling -- is monotonically decreasing function of the held-out test likelihood. The perplexity is equivalent to the inverse of the geometric mean per-word likelihood, and can be computed on a per-word, per-document, or per-corpus basis (there are other types of perplexity as well, see e.g. wikipedia. We compute the log-perplexity over the entire held-out test data [fold] as follows:

[ \mbox{log(perplexity)} = \frac{1}{W}\sum_m \sum_n \mbox{log} p(w_{m,n}|\mbox{training model}) ]

where W is the total number of tokens in the test documents, the first summation is over the M test documents, and the second summation is over the conditional probability of word n in test document m. The determination of the conditional probabilities varies from model to model and can be found in the appropriate Topic-Models subpage.

  • Note: a lower perplexity score indicates better generalization performance of the topic model

Clone this wiki locally