Skip to content

RENCI/map-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Running pipeline standalone (i.e., not running in the pipeline docker container)

install sbt

https://www.scala-sbt.org/download.html

compile code using sbt-assembly plugin to create a project JAR file with all of its dependencies

sbt assembly

the jar file generated is at target/scala-2.11/TIC preprocessing-assembly-0.1.0.jar

process data

spark-submit --driver-memory=2g --executor-memory=2g --master <spark host> --class tic.Transform <sbt assembly output> --mapping_input_file <mapping file> --data_input_file <data file> --data_dictionary_input_file <data dictionary file> --data_mapping_files <data mapping files> --output_dir <output dir> [--redcap_application_token <token>]

install csvkit

pip install csvkit
pip install psycopg2-binary

create db

create user <uid> with password '<pwd>';
create database "<db>";
grant all on database "<db>" to <uid>;

populate db

In output dir, execute

csvsql --db "postgresql://<uid>:<pwd>@<host>/<db>" --insert --no-create -p \\ -e utf8 --date-format "%y-%M-%d" tables/*

Running pipeline in the pipeline docker container

All dependencies as shown above in the running pipeline standalone section have been satisfied and encapsulated in the pipeline Dockerfile. As such, running the pipeline via Docker in a container is sufficient to streamline the whole data transformation pipeline and populate the CTMD database. When pipeline code is changed, all that is needed is to rebuild the pipeline image from the tic-map-pipeline-script repo and push the rebuilt image to txscience organization in dockerhub. When the code changes are significant enough to warrant a new version of the pipeline image, the referenced pipeline image in docker-compose.yml for local development environment in the ctmd-dashboard repo and in docker-compose.prod.yml for non-local development environment, e.g., production, stage, and dev environments, in the ctmd-dashboard repo need to be updated accordingly to pick up the updated pipeline image version. The sequence of the involved commands are listed below for easy reference.

If any of the inputs required to run the pipeline were changed, for example, mapping.json input file, or redcap input data or redcap data dictionary were changed, code in the map-pipeline-schema repo, mainly code in HEALMapping.hs need to be updated accordingly to map the redcap data fields into the database tables with the correct types. Then rebuilding the pipeline docker image will create an updated tables.sql file by running the updated pipeline schema code. When ctmd dashboard runs, the mapping.json file will be mounted to the pipeline container, and the updated tables.sql file will correspond to the updated mapping.json, and the data transformation in the pipeline should run through.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published