-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdemo_ridigbio_overview.Rmd
213 lines (182 loc) · 10.4 KB
/
demo_ridigbio_overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Basic overview of the ridigbio package"
output:
html_document:
code_folding: show
df_print: kable
---
_Code here written by [Erica Krimmel](https://orcid.org/0000-0003-3192-0080) for a workshop at the [2020 Digital Data conference](https://bit.ly/DigiData4). You can find the RMarkdown (editable/runnable) version of this document [here](https://biodiversity-specimen-data.github.io/specimen-data-use-case/solution/demo_ridigbio_overview.Rmd)._
You can use the [iDigBio API](https://github.com/idigbio/idigbio-search-api/wiki) to find specimen records using the same search parameters available in the [iDigBio Portal](https://www.idigbio.org/portal/search). Wrappers like [ridigbio](https://cran.r-project.org/web/packages/ridigbio/index.html), which we are covering in this demo, provide a simple way to use the iDigBio API in the context of your research pipeline. If you already use, or are considering using, R for data exploration or analysis, it makes sense to bring data into R directly from iDigBio via the API. In this demo we will cover a brief overview of fundamental functions in the ridigbio package that you can use to make your research pipeline more reproducible.
In this demo we will cover how to:
1. Write a query to search for specimens using `idig_search_records`
2. Quickly get a count of how many specimens match a query using `idig_count_records`
3. Discover the most frequent values for a field using `idig_top_records`
```{r message=FALSE}
# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)
# Load library for making nice HTML output
library(kableExtra)
```
## Write a query to search for specimens using `idig_search_records`
When you use an interface like the iDigBio Portal, you are already writing a query to search for specimens. If you are new to coding, it can be helpful to begin by figuring out your query in a user-friendly interface such as the Portal, then translating it to code in R once you understand what you want to search for. One of the hardest parts of using ridigbio to search for specimen records is know what the field you want to search is named. The iDigBio API provides a list of field names [here](https://search.idigbio.org/v2/meta/fields/records), but you will need to reference other sources, like [documentation for the Darwin Core standard](https://dwc.tdwg.org/terms/), to understand what kind of information these fields typically contain.
```{r}
# Let's start with a simple search introducing the primary arguments for the
# function `idig_search_records`
records_1A <- idig_search_records(
# `rq` is where you adjust your record query
rq = list(genus = "shortia"),
# `fields` is where you adjust what fields you want returned by the API
fields = c("uuid",
"family",
"genus",
"specificepithet",
"scientificname",
"stateprovince"),
# `limit` is where you can set a limit on the number of records to return in
# order to speed up your query; max is 100000
limit = 10,
# `sort` is where you can specify fields for sorting
sort = c("stateprovince",
"scientificname"))
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_1A) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
```{r}
# Now let's repeat the same search but remove all arguments other than `rq` to
# see what the defaults for the other arguments look like
records_1B <- idig_search_records(
rq = list(genus = "shortia"))
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_1B) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
```{r}
# In the example above, we are only using one parameter in `rq` to define our
# query, but now let's search by multiple parameters
records_2A <- idig_search_records(
rq = list(basisofrecord = "fossilspecimen",
# Use `type = "exists"` to search for rows where there is a value
# present in this field; the inverse of this is `type = "missing"`
geopoint = list(type = "exists")),
limit = 10)
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_2A) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
```{r}
# What if we wanted to see more fields than the default provides? Using the same
# search as above, we can retrieve all indexed fields with `fields = "all"`
records_2B <- idig_search_records(
rq = list(basisofrecord = "fossilspecimen",
geopoint = list(type="exists")),
fields = "all",
limit = 10)
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_2B) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
```{r}
# But wait, there are even more fields available than just those we retrieved
# in the query above! Using the same search, we can choose exactly what fields
# to retrieve from indexed and raw data if we call the fields out by name in
# the `fields` argument; raw data fields are prefaced by "data.dwc:" and use
# camelCase in their naming convention (vs. lowercase for iDigBio fields)
records_2C <- idig_search_records(
rq = list(basisofrecord = "fossilspecimen",
geopoint = list(type="exists")),
# Here is where we are explicitly asking for specific fields
fields = c("uuid",
"recordset",
"institutioncode", "data.dwc:institutionCode",
"country", "data.dwc:country",
"countrycode", "data.dwc:countryCode",
"stateprovince", "data.dwc:stateProvince",
"locality", "data.dwc:locality",
"geopoint", "data.dwc:decimalLongitude", "data.dwc:decimalLatitude"),
limit = 10)
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_2C) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
```{r}
# You may be curious what the difference is between indexed and raw data such as
# that we saw in the search above. Indexed data has been altered by iDigBio
# (often in an attempt to standardize and/or correct values), and raw data is
# what was provided to iDigBio by the data provider, i.e. the natural history
# collection. Here we will do a new search on a data quality flag to view
# differences between indexed and raw data
records_3A <- idig_search_records(
# Data quality flags are a way for iDigBio to communicate how data was altered
# during its quality control process, i.e. how the indexed and raw data differ
rq = list(flags = "rev_geocode_lat_sign"),
fields = c("uuid",
"institutioncode", "data.dwc:institutionCode",
"country", "data.dwc:country",
"countrycode", "data.dwc:countryCode",
"stateprovince", "data.dwc:stateProvince",
"locality", "data.dwc:locality",
"geopoint", "data.dwc:decimalLongitude", "data.dwc:decimalLatitude"),
limit = 10)
# Let's format our results to be more readable by renaming and reordering columns
records_3A <- records_3A %>%
rename_at(vars(starts_with("data.dwc:")),
~str_replace(., "data.dwc:", "raw_")) %>%
select(uuid,
indexed_decimalLatitude = geopoint.lat,
raw_decimalLatitude,
indexed_decimalLongitude = geopoint.lon,
raw_decimalLongitude,
everything())
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(records_3A) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```
## Quickly get a count of how many specimens match a query using `idig_count_records`
Sometimes the number of records matching a query is more important to your purposes than the records themselves, for instance if you are trying to calculate how many fossil specimens in iDigBio have geographic coordinate data. You can use the same query format for `idig_count_records` as you can for `idig_search_records` to answer the question "How many records will this query return?" more quickly. This function is also useful when you suspect that your query might return more than 100,000 records, which is the limit for any single iteration of `idig_search_records`.
```{r}
# Let's test out a search using parameters we know would retrieve many records
count_1A <- idig_count_records(
rq = list(basisofrecord = "fossilspecimen",
geopoint = list(type="exists")))
# We can reformat our result to be more readable
count_1A <- format(count_1A, big.mark = ",")
# This number shows how many records in iDigBio have a value of "fossilspecimen"
# as well as geographic coordinate data
count_1A
```
## Discover the most frequent values for a field using `idig_top_records`
If you are having trouble understanding what kind of information lives in a particular field, it may be useful to look at some of the most common values that exist in that field. The `idig_top_records` function can show you this. Again, this function uses the same basic `rq` argument to define the query.
```{r}
# Let's go back to our first simple search and see what the top values are for
# `scientificname` where the genus is "shortia"
top_1A <- idig_top_records(
# `rq` is where you adjust your record query
rq = list(genus = "shortia"),
# `top_fields` is where you adjust what fields you want to see summarized
top_fields = "scientificname",
# `count` is where you can set a limit on the number of top values to return
# in order to speed up your query; max is 1000
count = 10)
# We need to convert our results from a nested list into a more readable format
top_1A <- as_tibble(top_1A$scientificname) %>%
pivot_longer(everything(), names_to = "scientificname", values_to = "count")
# Display the data frame we just created above in a nice pretty table for HTML
knitr::kable(top_1A) %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(height = "300px")
```