-
Notifications
You must be signed in to change notification settings - Fork 5
Program description
The PUMA pipeline itself is built using Python. It begins by querying a Zotero library to get a list of publications, from there it uses DOIs, PubMed IDs and Scopus IDs to collate metadata from external APIs into a common metadata structure. This metadata is then used to build HTML web pages and output data for further analysis. An overview of the pipeline is at Infrastructure-overview.
The pipeline is split up into different functional sections. In each case the emphasis is on dealing with the core metadata object (described in Data-object), this is either built, amended, analysed or displayed in each phase. This page details each of the main sections in the source.
The main part of code that starts the pipeline and calls all of the stuff outlined below.
Parses the config file to set up global variables like paths and API keys. There are a lot of options that can be set in the config file - see Configuration.
Tidy the old file tree (the config file may flags to delete some caches etc set), and build the required bits of a new file tree if they are missing.
Get all the metadata from the various external sources (via their respective APIs) and store them in the raw section of the Data-object.
Parse what is in the raw section of the Data-object into the clean section. Some of this parsing is easy e.g. title is unambiguous and will (generally) always be present in the raw section. Some of the parsing is not. Mostly it is a case of trying field A of one data source in the raw section, if that fails try field B, if that doesn't work then try a different data source (e.g. try the raw data from PubMed first, then try the raw DOI data etc) until a reasonable value is found that can be copied into the clean section. Where there is some deviation from this is:
Names of institutes change all the time (especially over the lifetime of long studies like cohort studies). Then there are the many different spellings and formats of an institution name that people put on papers. Also institutes may be named differently in different languages. In order to merge all of these together a CSV file is used that has all the different spellings mapped to a canonical institute name. It is a manual step to generate this list, and will probably require running the program a few times - adding to the list incrementally.
The first step is to look at the first author's email address, if this is an institute we understand (i.e. is in the CSV lookup) then use that, if not then start to do some regexing on their actual institute.
Some authors don't have an institute specified in the raw metadata we get, so we can add it manually in zotero - see Fixing-errors.
Dates are hard. Different providers call dates by different names, and have different formats. Then there is the very definition of what a date on a publication should be (see PubMed-notes#dates). This cleaning tries very hard to make sure there is a reasonable date in the clean section.
Automatically convert the institute name into coordinates and get the country and city name. This is done with wikidata SPARQL queries.
Use the DOI and PMID to look up the number of citations on Scopus and PubMed Europe.
Look at all the Data-object and do some basic analytics
- How many times each author has a first author paper.
- How many times each author appears on any paper.
- How many times each journal is published in.
- Frequencies of each MeSH term.
- etc
This builds a network data structure of the authors, which can be plotted to show how each author is connected to each other author based on being co-authors on publications.
Some simple plotting routines for the output of the analyse step.
Output some html pages showing the list of papers, with links to them, and various metrics. This is the kind of thing you could put on your external facing website, if you wanted. It is also entirely encapsulated HTML/CSS/JS code, meaning you can just look at it on your computer without the need for a web server.
Output the Data-object in BibTeX format for insertion to your favourite reference manager.
Introduction
Install and run
Reference
Misc