easyPubMed is an open-source R interface to the Entrez Programming Utilities aimed at allowing programmatic access to PubMed in the R environment. The package is suitable for downloading large number of records, and includes a collection of functions to perform basic processing of the Entrez/PubMed query responses. The library supports either XML or TXT ("medline") format.
-
Simplified Pipeline. The process of retrieving and analyzing Pubmed records has been updated and simplified. The revised pipeline includes three steps: 1) submit a query; 2) fetch records; 3) extract information. The corresponding functions are discussed below in this vignette.
-
Automatic Job splitting into Sub-Queries. The Entrez server imposes a strict n=10,000 limit to the number of records that can be programmatically retrieved from a single query. Whenever possible, the
easyPubMedlibrary automatically attempts to split queries returning large number of records into lists of smaller, manageable queries. -
The easyPubMed S4 Class. Here we introduce a new S4 class (
easyPubMed) that was designed to better store and manage query information, retrieved records and associated meta-data. This S4 class comes with a series of methods and is aimed at improving data handling and analysis reproducibility. Raw or processed data can be obtained from aneasyPubMedobject via the appropriate getter functions (as discussed below). -
Additional Parsed Fields. The new
epm_parse()function supports extraction of additional information from Pubmed records (compared to previous versions of this R library). Extracted fields now include mesh_codes, grant_ids, references and conflict of interest statements (cois) among others. For more information, see the examples below. -
Compact Output. Unlike previous versions of
easyPubMed, it is now possible to collapse author information (i.e., author names) into a single string. This way, the output (parseddata.frame) only includes one row per record. -
Improved Support for Book Document Records. The revised
epm_fetch()function now supports the download and the identification of raw Book Document Records in eitherxmlormedlineformat. Note thatepm_parse()is still incompatible with this kind of records.
The latest version (3.1.3) of the library is hosted on GitHub, and you can install it using the devtools R library as follows.
devtools::install_github("dami82/easyPubMed")
-
easyPubMedis an open-source software, under the GPL-3 license and comes with ABSOLUTELY NO WARRANTY. For more questions about the GPL-3 license terms, see www.gnu.org/licenses. -
This R library was written based on the information included in the Entrez Programming Utilities Help manual authored by Eric Sayers, PhD and available on the NCBI Bookshelf (NBK25500).
-
This R library is NOT endorsed, supported, maintained NOR affiliated with NCBI.
-
There is only one person maintaining this R package: I work on code updates in my spare time and for free. Take-home message: please, be patient.