-
Notifications
You must be signed in to change notification settings - Fork 26
TMA Interface
The TMA analysis interface was designed to allow both simplicity and full control over the desired topic model analysis.
# Analysis
The Analysis section of the TMA interface is divided into two sections
- Required parameters, i.e. topic model, number of topics (for parametric models)
- Advanced (optional) parameters, i.e. convergence criteria, initialization options, etc
See the desired Topic Models page for detailed information about the input parameters for each topic model.
# Data
#### example The example datasets provide appropriate-sized real-world datasets for experimenting with TMA. The test data are described in detail under the Example Datasets page.
#### upload
TMA accepts a zip archived file or txt file (the files must be suffixed appropriately). The uploaded files may not exceed 20 MB for the online interface. The zip archive file can contain pdf or txt files (suffixed appropriately).
TMA currently only accepts raw data (meaning TMA handles the processing, just input the actual txt or pdf documents) in two formats:
- Individual files: each document is represented by an individual file, e.g. a zip file could contain 500 text files, where each text file is an article from the Washington Post. WARNING input files must have unix EOL markers (LF, not Windows LF+CR)
- Paragraphs (lines): each document is represented as a paragraph/line in the upload
txtorpdffiles, e.g. a singletxtfile could be uploaded where each line is an article from the Washington Post
#### url
TMA accepts pdf data directly from other servers by simply providing the url. This was designed to allow research group to easily upload their publications to TMA, e.g. by entering a url such as http://www.cs.princeton.edu/~blei/publications.html that has direct pdf links to the documents.
- Note: TMA respects robots.txt settings
- Note: TMA has a pdf download limit of 100 MB, and individual files cannot exceed 5 MB
#### arXiv
Search arXiv.org for publications by the specified authors. Separate multiple authors or various spellings using 'OR', e.g. Michael I. Jordan OR Michael Jordan OR David Blei OR David M. Blei, to search for the publications of the two authors with various spellings.
- Note: it is also possible to restrict the author search to specific fields using the 'subjects' selector provided (hold ctrl to select multiple subjects).
- Note: pdfs are downloaded in a random order from the arXiv query
# Processing
The processing options are applied in an 'OR' style, meaning that e.g. a term will be removed if it is below a tf-idf threshold OR it does not appear in enough documents.
NOTE: In the online implementation, TMA will keep a maximum of 5000 terms as ranked by tf-idf. This threshold can be changed in the [[settings] file for the offline implementation.
TMA provides the following processing options:
-
Select either paragraphs/lines or document representation of the uploaded/downloaded data: "paragraphs/lines" means that each document is one line in the uploaded/downloaded data (CAUTION: pdf-to-text paragraph conversion is unpredictable), "document" means that each uploaded/downloaded document represents 1 document and is not composed of many documents i.e. one per line.
-
Minimum number of words per document: Minimum number of words that constitute a document (removes documents whose word count falls below this threshold).
-
valid tf-idf fraction : Sort the terms in the vocabulary by their maximum TF-IDF scores and keep the top 'tf-idf fraction' -- this technique removes uninformative terms. Set the tf-idf fraction > 1.0 to not remove any terms (note: online implementation allow a maximum of 5000 terms).
-
stem words: stem the words using the Porter 2 stemming algorithm
-
remove case: remove capitalization from the words
-
remove stop words: remove stop words in the stop word list (future versions will allow users to specify their own stop word list)