In this repository, you find a synthetic implementation on Apache Spark framework of Riser Fatigue Analysis (RFA) scientific workflow based on a real case study in Oil and Gas domain. This implementation use the natively available Process library in Spark to call external black-box applications.
Risers Fatigue Analysis (RFA) workflow is a real case study in the Oil and Gas domain. It is composed of seven activities that receive input tuples, perform complex calculations based on them, and transform tuples into resulting output tuples.
-
Activities
- Uncompress Input Dataset - split one tuple into many tuples
- Preprocessing - map
- Analyze Risers - map
- Calculate Wear and Tear - filter
- Analyze Position - filter
- Join Results - join
- Compress Results - reduce tuples
Clone repository:
$ git clone https://github.com/hpcdb/RFA-Spark.git
$ cd RFA-Spark
$ vi input.dataset
- Example:
ID;SPLITMAP;SPLITFACTOR;MAP1;MAP2;FILTER1;F1;FILTER2;F2;REDUCE;REDUCEFACTOR
1;5;8;5;5;5;50;5;50;5;4
- Fields:
- ID: Entry identifier
- SPLITMAP: Average Task Cost in Uncompress activity (seconds)
- SPLITFACTOR: Number of entries in the input dataset after uncompression
- MAP1: Average Task Cost in Pre-Processing activity (seconds)
- MAP2: Average Task Cost in Analyze Riser sactivity (seconds)
- FILTER1:Average Task Cost in Calculate Wear and Tear activity (seconds)
- F1: Amount of entries for Calculate Wear and Tear activity to filter in % (i.e., Percentage that will continue in the flow)
- FILTER2:Average Task Cost in Analyze Position activity (seconds)
- F2: Amount of entries for Analyze Position activity to filter in %(i.e., Percentage that will continue in the flow)
- REDUCE: Average Task Cost in Compress Results activity (seconds)
- REDUCEFACTOR: Number of compressed output entries
- Start Apache Spark Cluster
- Set SPARK_HOME environment variable
$ export SPARK_HOME=/path/to/spark
- Change directory to RFA-Spark home:
$ cd RFA-Spark
- Run:
$ ./run.sh <spark-master-url> <num-executors> <total-executor-cores>
Where:
-
spark-master-url: The master URL for the cluster
-
num-executors: Number of Apache Spark executors requested on the cluster.
-
total-executor-cores: Total Number of cores requested on the cluster.
-
Example:
$ ./run.sh spark://hostname:7077 1 2
- Change directory to rfa-spark-project:
$ cd RFA-Spark/rfa-spark-project
- Maven
$ mvn package