goldi
is a tool for identifying key terms in text. It has been developed with the intention of identifying ontological labels in free form text with specific application to finding Gene Ontology terms in the biomedical literature with strict canonical NLP quality control.
The package is currently checked on R-oldrel
(v3.3.3
), R-release
(v3.4.0
), and R-devel
(v3.5.0
) on
goldi
can be installed from CRAN with
install.packages("goldi")
Or, you may choose to install the latest stable development version with
devtools::install_github("Chris1221/goldi")
goldi
attempts to identify terms in free text through semantic similarity. This means that if a term and a sentence share a high number of words, the sentence has a higher probability of talking about the term.
Given the following input text and the included pre-computed term document matrix for approximately 10,000 Gene Onotlogy molecular function terms, we can find which are discussed in our text.
# Give the free form text
doc <- "In this sentence we will talk about ribosomal chaperone activity. In this sentence we will talk about nothing. Here we discuss obsolete molecular terms."
# Load in the included term document matrix for the terms
data("TDM.go.df")
# Pipe output and log to /dev/null
output = "/dev/null"
log = "/dev/null"
# Run the function
goldi(doc = doc,
term_tdm = TDM.go.df,
output = output,
log = log,
object = TRUE)
Note in the above example, we impliment a few other options. Firstly, we don't want to see the output or the log for this example, so we pipe them to /dev/null
. Secondly, we would like to return the output as an R object instead of writing it to a file, so we specify object = TRUE
.
This will output the following table:
Term | Context |
---|---|
ribosomal_chaperone_activity | In this sentence we will talk about ribosomal chaperone activity |
This will give the term identified and the context in the free form where it was identified. This table will form the basis for all further analysis.
Q: This is all really confusing, where can I read more about this package?
A: Please see the pre print of our paper.
Q: How does goldi
match terms to sentences?
A:
goldi
accomplishes this by finding the number of similar words in a term and in a sentence, comparing this to a user defined acceptance function A(n) based on the length of the term n. The default function is given by the following:This may be represented as a vector in R
lims <- c(1,2,3,3,4,5,6,6,7,8,9)
If the number of words present equals or exceeds this function, then a match is declared. You are encouraged to play around and find what acceptance function works for you.
Q: What if I don't have my text in R, but instead as a text or PDF file?
A:
goldi
has four distinct methods for importing text locally, please see the wiki article on the subject.
Q: Installation from CRAN is not working and it says something about slam
, what's going on?
A: Newer versions of the
tm
package, which is a dependency ofgoldi
require a package namedslam
which needs to be compiled from Fortran. Try the following, and if it doesn't work, raise an issue on the repository and we'll get it fixed! Type the following into terminal (on Mac OSX):
curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /
Install slam
:
install.packages("slam")
Reinstall goldi
:
install.packages("goldi")
Q: When I install the package, I get messages about libc
or gcc
versions. What's happening?
A: The most likely scenario is that your
gcc
compiler (which compiles thec++
code) is out of date, espcially if you are on an older version of linux distribution like CentOS on some cluster systems. Contact your system administrator and try to updategcc
.
Q: How can I work with abstracts from pubmed?
A: We recommend the
RISmed
package.
Q: Where can I see some examples of this package in use?
A: Please see the included vignettes, especially the overexpression analysis implimented in the paper.
Q: I am looking for a project to work on with goldi
, do you have any ideas?
A: Please see here.
Q: Nothing is working, who can I complain to?
A: Please raise an issue on this repository, that's most likely to get answered.
Cole, Christopher B., et al. "Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi." bioRxiv (2016): 073460.
@article{cole2016semi,
title={Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi},
author={Cole, Christopher B and Patel, Sejal and French, Leon and Knight, Jo},
journal={bioRxiv},
pages={073460},
year={2016},
publisher={Cold Spring Harbor Labs Journals}
}