Program description

Overview

The PUMA pipeline itself is built using Python. It begins by querying a Zotero library to get a list of publications, from there it uses DOIs, PubMed IDs and Scopus IDs to collate metadata from external APIs into a common metadata structure. This metadata is then used to build HTML web pages and output data for further analysis. An overview of the pipeline is at Infrastructure-overview.

Source code

The pipeline is split up into different functional sections. In each case the emphasis is on dealing with the core metadata object (described in Data-object), this is either built, amended, analysed or displayed in each phase. This page details each of the main sections in the source.

./papers.py

The main part of code that starts the pipeline and calls all of the stuff outlined below.

Config

Parses the config file to set up global variables like paths and API keys. There are a lot of options that can be set in the config file - see Configuration.

Setup

Tidy the old file tree (the config file may flags to delete some caches etc set), and build the required bits of a new file tree if they are missing.

Get

Get all the metadata from the various external sources (via their respective APIs) and store them in the raw section of the Data-object.

Clean

Parse what is in the raw section of the Data-object into the clean section. Some of this parsing is easy e.g. title is unambiguous and will (generally) always be present in the raw section. Some of the parsing is not. Mostly it is a case of trying field A of one data source in the raw section, if that fails try field B, if that doesn't work then try a different data source (e.g. try the raw data from PubMed first, then try the raw DOI data etc) until a reasonable value is found that can be copied into the clean section. Where there is some deviation from this is:

Institute

Names of institutes change all the time (especially over the lifetime of long studies like cohort studies). Then there are the many different spellings and formats of an institution name that people put on papers. Also institutes may be named differently in different languages. In order to merge all of these together a CSV file is used that has all the different spellings mapped to a canonical institute name. It is a manual step to generate this list, and will probably require running the program a few times - adding to the list incrementally.

The first step is to look at the first author's email address, if this is an institute we understand (i.e. is in the CSV lookup) then use that, if not then start to do some regexing on their actual institute.

Some authors don't have an institute specified in the raw metadata we get, so we can add it manually in zotero - see Fixing-errors.

Date

Dates are hard. Different providers call dates by different names, and have different formats. Then there is the very definition of what a date on a publication should be (see PubMed-notes#dates). This cleaning tries very hard to make sure there is a reasonable date in the clean section.

Add

Geocode

Automatically convert the institute name into coordinates and get the country and city name. This is done with wikidata SPARQL queries.

Citations

Use the DOI and PMID to look up the number of citations on Scopus and PubMed Europe.

Analyse

Look at all the Data-object and do some basic analytics

How many times each author has a first author paper.
How many times each author appears on any paper.
How many times each journal is published in.
Frequencies of each MeSH term.
etc

Networks

This builds a network data structure of the authors, which can be plotted to show how each author is connected to each other author based on being co-authors on publications.

Plots

Some simple plotting routines for the output of the analyse step.

Web pages

Output some html pages showing the list of papers, with links to them, and various metrics. This is the kind of thing you could put on your external facing website, if you wanted. It is also entirely encapsulated HTML/CSS/JS code, meaning you can just look at it on your computer without the need for a web server.

Bibliography

Output the Data-object in BibTeX format for insertion to your favourite reference manager.

Home

Introduction

Install and run

Reference

Misc

Program description

Overview

Source code

./papers.py

Config

Setup

Get

Clean

Institute

Date

Add

Geocode

Citations

Analyse

Networks

Plots

Web pages

Bibliography

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally