Skip to content

Latest commit

 

History

History
91 lines (61 loc) · 3.32 KB

README.md

File metadata and controls

91 lines (61 loc) · 3.32 KB

ReCitable

Data citation made reproducible.

Getting started!

As dataset change over time queries on these datasets are often invalided or not reproducible because of the changes in the dataset. A possible solution is to version the datasets as well as the queries and save the version of the dataset the query was run on.

This is a simple prototype that saves and versions the datasets and queries in Git, a distributed version control system. Furthermore it allows to:

  • Test queries on existing datasets and save them with a corresponding persistent identifier.
  • Rerun queries that can be selected from their persistent identifier.

How to test?

Data collection

Prerequisite: Git must be installed on your computer.

At the moment datasets can only be registered via Git directly! At the moment datasets can only be in CSV with ; as field separator! You can use e.g. sed to replace the field separators. _Although git is very good in terms of storage usage many changes can bloat the repository. Ideally you should use git gc to start the git garbage collection.

Prepare a git repository that automatically collects your data. E.g. create a git repository by running:

cd /home/pi/MetroData/Database
git init

Create a script that crawls and commits your data:

#!/bin/bash
cd /home/pi/MetroData/
# Get the data from an open data portal
# (in this case meteorological data  from Austria)
curl --retry 5 -L -o tawes1h.csv http://www.zamg.ac.at/ogd/
# Delete the header so it doesn't corrupt your data
sed -i -e '1d' tawes1h.csv
# Pipe the data into a dataset in the database folder
cat tawes1h.csv >> Database/ZAMG-MetroData.csv
rm tawes1h.csv
cd Database/
# Commit the change to the database
git checkout master
message=`date +%Y-%m-%d.%H:%M`
git commit -am "ZAMG-MetroData $message"
# For a smaller storage footprint use the following command
git gc

Create a cronjob that runs your script automatically.

Starting the application

Prerequisite: Maven must be installed on your computer.

Download the ReCitable repository from Github.

Change to the DOWLOAD_DIR and run the following command:

mvn clean install

Then change to resources directory of the web application by running:

cd DOWLOAD_DIR/webapp/src/main/resources

At the moment only a single repository for datasets is supported!

Change the parameter databaseLocation to point to the repository you created before e.g. /home/pi/MetroData/Database.

Change to the web application directory and run the Jetty server with the following command:

cd DOWLOAD_DIR/webapp
mvn jetty:run

You can now access the web application at the URL http://localhost:8080.

There you can:

  • Select a dataset and try different queries on it. Standard SQL can be used in the text area, but you always need to provide the dataset name as table reference e.g. SELECT * FROM ZAMG-MetroData.
  • Assign a PID and a description to the query and save it.
  • Rerun a query that was saved before exactly the way it was run before.
  • Returns to the start any time you want by using the logo link.

Be aware that this is a prototype and you can easily destroy your database by altering the datasets!