Fixing errors

Given that we are harvesting data from lots of sources there are bound to be missing bits of data that need to be dealt with. We've built a web page that helps to find and fix them. In the cache folder will be a file called coverage_report.html which is a large table like:

It highlights where there is missing data. Where there is a missing item that may really be missing (e.g. because there is no metadata available from doi.org) it is orange. If it really should be present (e.g. because it has an address but it has failed to geocode it due to a gap in the config files) then it is highlighted red.

Since it is very helpful to have both a DOI and PMID there is a clickable link to search PubMed based on a DOI etc. Where new info like is found it can be added to Zotero.

There are cases where external data really is missing, but where it can be fixed:

Common missing data errors are:

Institute X not on wikidata
Date missing
Clean institute missing

If you can't find what you're looking for on this page check here for information on local data required to run the script. Sometimes you might get errors if some required local data is missing.

Institute X not on Wikidata

The wikidata lookup uses a SPARQL query to get the coordinates of an institution. The name of the institution on wikidata must exactly match the clean institution name for it to be found. The SPARQL returns the wikidata ID for the institution. This is then used to do an API request to get all the data for the object. The property of interest is P625 which is the coordinate location of the wikidata item. Sometimes the location of the institution is not its own statement and instead the P625 property is part of the headquarters (P159) statement.

New wikidata items can be added through a link on the left sidebar (New Wikidata Item). Some of the Wikidata items we have added have been removed, this might be because admins do not think that the descriptions, aliases or references are good enough so make sure you add as much detail as possible. Wikidata items are made up of statements, e.g. Coordinate location. Statements can be added by scrolling to the bottom of the statements section on the item page and clicking "add" and then type "coordinate location" or "P625" into the property box. If the institution has an existing headquarters statement it might be better to add the coordinates to the headquarters statement. This is also a good idea if the institute is spread out over a city rather than being on one campus. In the paper data object the coordinates can be accessed using the index ['Extras']['LatLong'].

There can be problems when the names of institutions are different in different languages. For example, the Technische Universität München was originally listed on wikidata under its German name but on English google it is referred to as Technical University of Munich. This is a problem because the query attempts an exact match with the English label. The script should be able to handle the special characters, such as ü, but you need to be careful when deciding what name to put into the institute cleaning file.

Sometimes adding institutes to Wikidata can be a problem. More information on this can be found here.

Zotero Extra

Some data can be missing from the data sources (for example, dates). This is solved by entering structured data into the extra field on Zotero. This is then parsed by the script. The data needs to be in the format

<key>:<value>\n<key>:<value>. Key-value pairs are separated by a newline (\n). This string is then parsed into the paper object into [merged][zotero_extras]. This happens in the clean phase.

Allowed key-value pairs (with examples) are:

date:01/01/1970
clean_institute:University of Bristol
clean_first_author: Smith J
keywords: science|hats|cats

Note that clean_institute is here, but the first author affiliation is not, I figure this makes sense. Note the format of clean_first_author. Also note that these are a last resort when trying to figure out the values - THEY DO NOT OVERRIDE VALUES FROM ELSEWHERE.

If you update values in Zotero then you need to delete the corresponding Zotero file in the cache//raw/zotero directory (sometimes it is easier to just delete them all).

Multiple Papers Found for DOI

This is a frustrating one - scopus ignores punctuation in queries, so this makes DOIs ambiguous in some cases. This means, for example, that the results for 10.1038/ng.714 and 10.1038/ng714 are returned together. To fix this when multiple results are returned then only papers with exact matching DOIs are counted. If there are multiple matches, in the case where they are multiple records for a paper, then the citation counts are added together if the titles also match.

Errors in PubMed data

Sometimes there are errors in the raw data from PubMed. If we can fix these we should:

http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Typographical_Errors

Home

Introduction

Install and run

Reference

Misc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing errors

Institute X not on Wikidata

Zotero Extra

Allowed key-value pairs (with examples) are:

Multiple Papers Found for DOI

Errors in PubMed data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally