A data pipeline that is built for filtering closed order data and pushing the data to Hive and Hbase on a scheduled basis with a notification for success and failure of a pipeline
- Fetching the orders data from S3 bucket
- Creating a customers info table in mysql by loading the data from customers info file [link]
- Loading the customers information from mysql database to hive using sqoop
- Filtering the orders data for closed orders by processing the data with spark
- Creating table for the closed orders in hive
- Joining the closed orders table along with customers table in hive (The data is stored in hbase it is possible because of the hive-hbase integration)
- Send success or failure notification to slack channel
-
Installing hadoop,hive,mysql db and hbase in an EC2 instance and creating a connection id in airflow for executing the pipeline in EC2 instance
i. Refer to this article for hadoop installation on ubuntu
ii. Refer to this article for hive installation on ubuntu
iii. Refer to this article for mysql installation on ubuntu
iv. Refer to this article for hbase installation on ubuntu -
Installing docker for running airflow container
i. Refer to this documentation for installing docker
ii. For running airflow container use the docker-compose.yaml# Clone the repo and open a terminal and cd into repo folder and run the following command docker-compose up
-
Create a SSH connection id for connecting to EC2 instance refer to this article (Note: go to airflow-ui for accessing the Airflow UI)
-
Create a Slack webhook integration and configure slack webhook connection in airflow (To create a connection goto admin section in Airflow UI and click on connections)
i. Refer to this article for creating slack webhook
ii. Refer to this article for configuring slack webhook in airflow