Data citation made reproducible.
As dataset change over time queries on these datasets are often invalided or not reproducible because of the changes in the dataset. A possible solution is to version the datasets as well as the queries and save the version of the dataset the query was run on.
This is a simple prototype that saves and versions the datasets and queries in Git, a distributed version control system. Furthermore it allows to:
- Test queries on existing datasets and save them with a corresponding persistent identifier.
- Rerun queries that can be selected from their persistent identifier.
Prerequisite: Git must be installed on your computer.
At the moment datasets can only be registered via Git directly!
At the moment datasets can only be in CSV with ; as field separator!
You can use e.g. sed to replace the field separators.
_Although git
is very good in terms of storage usage many changes can bloat the repository. Ideally you should use
git gc
to start the git
garbage collection.
Prepare a git repository that automatically collects your data. E.g. create a git repository by running:
cd /home/pi/MetroData/Database
git init
Create a script that crawls and commits your data:
#!/bin/bash
cd /home/pi/MetroData/
# Get the data from an open data portal
# (in this case meteorological data from Austria)
curl --retry 5 -L -o tawes1h.csv http://www.zamg.ac.at/ogd/
# Delete the header so it doesn't corrupt your data
sed -i -e '1d' tawes1h.csv
# Pipe the data into a dataset in the database folder
cat tawes1h.csv >> Database/ZAMG-MetroData.csv
rm tawes1h.csv
cd Database/
# Commit the change to the database
git checkout master
message=`date +%Y-%m-%d.%H:%M`
git commit -am "ZAMG-MetroData $message"
# For a smaller storage footprint use the following command
git gc
Create a cronjob that runs your script automatically.
Prerequisite: Maven must be installed on your computer.
Download the ReCitable repository from Github.
Change to the DOWLOAD_DIR
and run the following command:
mvn clean install
Then change to resources directory of the web application by running:
cd DOWLOAD_DIR/webapp/src/main/resources
At the moment only a single repository for datasets is supported!
Change the parameter databaseLocation
to point to the repository you created before e.g.
/home/pi/MetroData/Database
.
Change to the web application directory and run the Jetty server with the following command:
cd DOWLOAD_DIR/webapp
mvn jetty:run
You can now access the web application at the URL http://localhost:8080.
There you can:
- Select a dataset and try different queries on it. Standard SQL can be used in the text area, but you always need to
provide the dataset name as table reference e.g.
SELECT * FROM ZAMG-MetroData
. - Assign a PID and a description to the query and save it.
- Rerun a query that was saved before exactly the way it was run before.
- Returns to the start any time you want by using the logo link.
Be aware that this is a prototype and you can easily destroy your database by altering the datasets!