Skip to content

Data Relationships

cjrd edited this page Jun 12, 2012 · 1 revision

This page discusses the various data relationships determined in TMA. The relationship scores rely on the following relationship/probability matrices acquired from the topic models (see the appropriate topic model page for details on how these matrices are determined):

  • topic x term probability matrix
  • document x topic probability matrix
  • document x term count matrix

The following scores are used to assign relationships in TMA. All scores are designed so that more positive (less-negative) scores indicate a stronger relationship between the given entities.

# Doc-Doc Score

The document-document similarity score is determined using the inverse Hellinger Distance between the two document x topic probability vectors, i.e. select the two document rows from the document x topic matrix, compute the Hellinger Distance between these two multinomials, and assign a score of 1/Hellinger-Distance.

# Doc-Term Score

The document-term similarity score is determined using the raw, unnormalized word counts of the given term in the given document.

# Doc-Topic Score

The document-topic similarity score is determined directly from the probabilities in the document x topic probability matrix.

# Term Score

The stand-alone score of a term is simply the number of occurrences of the term in the corpus.

# Topic Score

The stand-alone topic score is the conditional probability of the topic given all of the documents. In other words, the topic score is determined by summing the appropriate column of the document x topic probability matrix (the sum is replaced with a product if likelihoods are used instead of log-likelihoods in the document x topic matrix).

# Topic-Term Score

The topic-term similarity score is determined directly from the probabilities in the topic x term probability matrix.

# Topic-Topic Score

The topic-topic similarity score is determined using the inverse Hellinger Distance between the two topic x term probability vectors, i.e. select the two topic rows from the topics x terms matrix, compute the Hellinger Distance between these two multinomials, and assign a score of 1/Hellinger-Distance.

Clone this wiki locally