-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dataset]: Animal Satellite Telemetry data #145
Comments
The netCDF specification will be documented at https://ioos.github.io/ioos-atn-data/ |
need to decide on a decimation strategy. The frequency of observations varies from 2 minutes to multiple days.
|
The decimation strategy that ETN and OTN are working on for acoustic telemetry data is down to a lot of hard work by Peter Desmet and Jonas Mortelmans, and is based in some of Peter's work on camtrap-dp and with other satellite tagged animals. It employs an aggregation strategy of 'take the first detection/location per hour', with other Darwin Core fields like dataGeneralizations helping characterize the summarization by indicating how many detections have been obfuscated by the aggregation. The benefit of using this method is that each detection is a real point in space and time that the animal was observed, and also it puts a hard upper bound per tag on how many occurrences can be generated by a single individual/tag. There's a lot of background information and ancillary decisions made about how to characterize things like coordinateUncertainty inbo/etn#256 and what the logic for the decimation of the events themselves are here: https://github.com/inbo/etn/blob/main/inst/sql/dwc_occurrence.sql I've got more code coming that deals with pulling together an Event Core version, with the Occurrences still being generated in a decimated way like this, but with tag attachment and listening station deployments being handled as Events and more things being reported as Extended Measurement or Facts. |
I created an example DwC-A package in this PR ioos/ioos_code_lab@e58b2b5 The template still isn't finalized so I don't want to go too far down the road, but @albenson-usgs gave some great feedback on the initial package, to start addressing:
|
For reference, below is a table of the data available (dumped from the netCDF file), followed by the netCDF header of the metadata available. THESE ARE EXAMPLE DATA and therefore I have redacted some information about the PI. I think we can address all of the comments above from the available data and metadata. data table:
netCDF metadata:
|
@albenson-usgs I'm poking around in this now. For But maybe that's only for the tagging event? Now that I'm fiddling with the data more, I'm wondering if there should be two/three events.
cc @mmckinzie |
Maybe https://github.com/tdwg/dwc-for-biologging/wiki/Movebank-GPS-data#darwin-core-recommendation is the right way? |
This is what I understand from the text on movebank GPS data: flowchart LR
A([Deployment])
B([Tag attachment])
C([GPS positions])
A --parentEventID--> B
A --parentEventID--> C
subgraph parent event
A
end
subgraph child events
B
C
end
|
I worked through some reorganizing after discussion on the Slack space. I think I have addressed most of the comments in #145 (comment) It was decided to go with occurrence and emof (no event). Here are the files and notebook for review:
I am most curious about additional information we could be porting into the We also have a few flag variables (time, speed, location, and rollup) and a bunch of metadata that could be stuck somewhere. |
ATN data are now being archived at NCEI. For the notebook I'm working on here, I would like to pull the source data from this archival information package. https://www.ncei.noaa.gov/archive/accession/0282699 |
@sformel-usgs will handle the next review on this. Also I know that @jdpye published some (lots?) of data to OBIS somewhat recently and might have some words of wisdom to share. |
We did! I looked over Mat's shoulder briefly at the IOOS DMAC but I would gently recommend we further align this to the standard that OTN and ETN had worked out for all our satellite and acoustic telemetry data publishing, if it's possible. Just a bit of summarization of the occurrences to keep the row count manageable when our datasets get included in general queries against OBIS in the future. |
Here is the mapping table for the occurrence record:
|
And for the measurement or fact file
|
@MathewBiddle I'm still getting up to speed on this. Does anything need review right now? |
@jdpye From #145 (comment), my understanding is the decimation strategy for these satellite telemetry observations should be:
So, I will work on taking my occurrence table and decimating it to the first detection each hour. Does that sound reasonable? |
@sformel-usgs Yes! If you don't mind taking a look at the csv files I reference in #145 (comment), that will help us in the overarching organization of these data. I think the decimation strategy will simply limit the amount of rows from what we have above. |
Yep! With this, you can add into dataGeneralizations a string like 'first of # records' to indicate there are more records in the raw dataset to be discovered by the super-curious. |
I just finished prototyping up a DwC archive to lonboard / Deck.gl vis tool and so i will attempt to eat your DwC archive with it when i get time! |
Here's a stab at filtering the occurrence record down to the first occurrence per hour (in Python). https://gist.github.com/MathewBiddle/d434ac2b538b2728aa80c6a7945f94be Now to write that in R... |
Figured out how to do it in R (hacky but works for now): library(lubridate)
# sort by date
occurrencedf <- occurrencedf %>% arrange(eventDate)
# create column of date to the hour which will be our decimation strategy
occurrencedf$eventDateHrs <- format(as.POSIXct(occurrencedf$eventDate, format="%Y-%m-%dT%H:%M:%SZ"),"%Y-%m-%dT%H")
# filter table to only unique date + hour and pick the first row keeping all the columns
occurrencedf <- distinct(occurrencedf,eventDateHrs,.keep_all = TRUE)
# nuke the invented column
occurrencedf$eventDateHrs <- NULL
occurrencedf Filtering by data quality codesIn these data we also have additional information about the Location Quality Code from ARGOS satellite system and QARTOD tests. Below are the codes and those meanings. ARGOS Codes
Since codes Also, create a mapping table for
QARTOD Codes
The QARTOD tests are:
I'm not sure what to do here. My preference would be to include all rows where |
@sformel-usgs @jdpye I've updated the notebook (and on nbviewer) to include this decimation strategy as well as adding in some initial filtering based on location class and the inclusion of
If you don't mind taking a look when you get a chance, it would be much appreciated! I think there are some additional details we can add to the occurrence/emof from the netCDF files, I'm just not sure what. |
@MathewBiddle here are some thoughts. I'm still feeling like I don't have a good grasp on all the moving parts, so please ping me here or in Slack if there is anything I didn't address specifically, no matter how small. I don't see any big issues, what you've derived works as a DwC-A. But I'm going to dig through the data a little more and see if there is anything else I think could be included.
|
I think I can help find your P01 codes for the measurements, sorry, I didn't look at the emof file on the first pass. I'll look at this today! |
for the coordinateUncertaintyInMeters distance for Argos location class 0, this paper suggests an upper bound of ~ 10km. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0063051 From that paper, this quote:
does not fill my heart with joy, so the upper bound of the estimate is probably a safer value to include. |
Thanks for taking a look! I should have mentioned the EML section of the notebook is a work in progress. It should reference the same netcdf file that is used to generate the dwc files (the one from NCEI). I just haven't updated it in a few months. Something to discuss is if generating the EML is even necessary. Would OBIS-USA generate the EML? Is there a way to for a provider to upload an EML xml file? How should we deal with this with the expectation that we might want to automate the process? |
If everyone has filled in their metadata for the NetCDF files in the same way, we should get a simple EML template for this flavour of data and map our incoming data to it, and submit that to your OBIS publication endpoint along with the data, as an initial pass of the metadata for the archive. Amendments can be made after the initial metadata harvest from the source NetCDF, but we should have a good start from there. If we build a simple eml.xml and zip it up, the metadata pre-populates and will save your OBIS data manager a bit of headache :D |
@MathewBiddle the IPT is all fat fingers. So, the more EML you can generate programmatically, the less time it will take and the less chance of human error. But just do the easy stuff, don't worry about getting every detail. |
Since these are satellite telemetry observations, our |
No that's fine that they are the same value. |
I have added in min/max depth to the occurrence file https://github.com/MathewBiddle/ioos_code_lab/blob/r_nc2dwc/jupyterbook/content/code_gallery/data_management_notebooks/atn_45866_occurrence.csv I've merged the optimizations @sformel-usgs proposed and cleaned up some of the comments. As far as the metadata goes, the source netCDF files are built via an automated pipeline, so we know what content is going where and how much (or little) it will be standardized. It's merely a mapping exercise to get the information into EML for the records. However, I am curious to get @mmckinzie to weigh in on the granularity of the "datasets" for OBIS. Right now, we are archiving at NCEI on a deployment by deployment basis, is that too granular for OBIS? Obviously, it would be much simpler to have 1 ATN dataset that is updated with new deployments as they make it to NCEI. But, we loose some granularity in the credits at OBIS when we do that. Some items to consider:
Maybe there's lessons learned from the CREMP datasets we should explore? I think answering those questions will help us decide what needs to be mapped into the metadata record. |
Should we also include |
@MathewBiddle sorry if I'm overlooking it in the above comments, could you point me to an example metadata record from ATN/NCEI? I don't have a sense of what is included, how many people are credited, and how often it's updated. |
Here is the NCEI landing page for this dataset https://www.ncei.noaa.gov/archive/accession/0282699 That metadata record is built at NCEI directly from the netCDF file, plus any additional NCEI metadata. My hope would be that we would build the EML metadata directly from the netCDF file instead of harvesting from another source. But, I'm open to suggestions. In a perfect world, these data wouldn't have updates. The archive packages will be updated only when there are additions of other observing methods, like profile observations or modeled tracks (foie gras analysis), which would be added in separate files. So, the satellite telemetry data files would be static. But, we all know that perfect worlds are hard to come by, so building in an update process would be who of us. As for the number of people credited, that could be anywhere from 1-n, some of these will be one PI, others could have ten, it's highly variable. Note: ATN and NCEI are still working out the authorship and acknowledgements in the files and resultant NCEI metadata as some pieces we're mapped correctly. That should be addressed very soon. |
I got confused with the files in different repos. So, I've added the mobilization notebook here as a PR and converted it to .Rmd rmarkdown:::convert_ipynb('atn_satellite_telemetry_netCDF2DwC.ipynb',"atn_satellite_telemetry_netCDF2DwC.Rmd") The .Rmd, source data, and resultant DwC can be found in this directory: |
I like the samplingProtocol as 'satellite telemetry', we were talking with the rest of the tdwg MOBS group about deciding on a controlled or suggested vocabulary for samplingProtocol and any steps we take towards that will help us down the line. I would argue strongly for creating granular datasets, first because attribution can be precise and comprehensive without overattributing researchers to unrelated tracks held at ATN, but also because that would allow individual researchers to revise/update/extend their program or their individual track data as needed without triggering a major update of some ATN-wide archive. |
Is there a place in Darwin Core where we could have a link that goes to the NCEI archived raw data? @laurabrenskelle was looking into this. |
We would also want to do this for passive acoustic data. Pointing back to the raw audio files at NCEI. |
Created an issue to discuss this in the DwC Q&A repo: tdwg/dwc-qa#207 |
|
I agree, if the identifier for the observation record can stay consistent across service endpoints, that would be ideal, and @MathewBiddle for PAM and the raw audio, I think we should use |
I don't want to close this just yet. TODO:
|
@laurabrenskelle can you take a look at the rmd and see how we can add the NCEI url into the dwc archive? See https://github.com/ioos/bio_data_guide/tree/main/datasets/atn_satellite_telemetry/ And |
@MathewBiddle Are we just wanting to add the link to the landing page to |
Good question. By granular, what do you mean? I don't think we can get much more granular than that from NCEI. Unless we're talking about the specific url to the data file? I think either including the url to the landing page (eg. https://www.ncei.noaa.gov/archive/accession/0282699) or one of the identifiers from the landing page (screenshot below) would suffice. |
Sorry, I guess because this is just one dataset from one shark's track, the landing page should suffice. Is that the case for all ATN data, or are they ever aggregated with data from multiple animals mixed together? |
They will be archived on a deployment by deployment basis. So it should be one animal for each netCDF file. |
An "old" but still ATN-relevant conversation from the TDWG Darwin Core Q&A issues: tdwg/dwc-qa#173 I thought it was worth dropping here for future reference. |
Thanks, @laurabrenskelle ! I think I forgot to include |
Contact details
[email protected]
Dataset Title
ATN satellite telemetry data
Describe your dataset and any specific challenges or blockers you have or anticipate.
We are very close to a final netCDF template for ATN's satellite trajectory deployment files.
https://github.com/ioos/ioos-atn-data/blob/main/templates/atn_trajectory_template.cdl
Last year, I developed an R script to read in the template and start creating a DwC-A package. This year I'd like to finish that work, assuming we finish the template and create some example files.
https://github.com/MathewBiddle/ioos_code_lab/blob/r_nc2dwc/jupyterbook/content/code_gallery/data_management_notebooks/DRAFT-R-netCDF2DwC.ipynb
xref:
Link to "raw" Data Files.
https://github.com/ioos/ioos-atn-data/tree/main/data
The text was updated successfully, but these errors were encountered: