This project demonstrates an automated ETL (Extract, Transform, Load) pipeline using AWS services to process daily delivery data for a hypothetical company called SpeedDash. The goal is to provide a simple, scalable, and efficient solution for real-time data processing.
The Speed-Dash Data Pipeline is a serverless data processing solution that leverages AWS services to automate the extraction, transformation, and loading of daily delivery data. When a JSON file containing delivery records is uploaded to a designated S3 bucket, an AWS Lambda function is triggered. The function filters the records based on delivery status and stores the processed data in a different S3 bucket. Additionally, Amazon SNS is used to send notifications regarding the status of data processing, ensuring stakeholders are informed in real-time.
While Data Engineering projects can often seem complex and daunting, this project showcases a straightforward approach to building a functional data pipeline. By leveraging managed AWS services, the focus remains on solving the problem rather than dealing with infrastructure management. The project is inspired by similar real-world scenarios and is designed to be both educational and practical.
-
Amazon S3: Used for storing both raw input files and processed output files. It serves as the data lake for this project.
Learn More About Amazon S3 -
AWS Lambda: A serverless compute service that runs the data processing logic whenever a new file is uploaded to the S3 bucket. It is configured to filter records based on delivery status and save the filtered data to another S3 bucket.
Learn More About AWS Lambda -
Amazon SNS (Simple Notification Service): Used to send notifications about the processing status to subscribed users via email.
Learn More About Amazon SNS -
AWS CodeBuild: A fully managed build service used to automate the deployment process of the Lambda function and its dependencies from a GitHub repository.
Learn More About AWS CodeBuild
- AWS Account
- Amazon S3 Buckets:
speed-dash-landing-zone
(for raw files) andspeed-dash-target-zone
(for processed files) - AWS Lambda
- Amazon SNS
- AWS IAM Roles (for managing permissions)
- AWS CodeBuild (for CI/CD)
- GitHub (for version control)
- Python, including the
pandas
library - Email Subscription for SNS notifications
-
Set Up S3 Buckets:
- Create two S3 buckets:
speed-dash-landing-zone
(for incoming raw JSON files) andspeed-dash-target-zone
(for processed files).
- Create two S3 buckets:
-
Create Sample JSON File:
- Prepare a sample JSON file (e.g.,
2024-03-09-raw_input.json
) containing delivery records with various statuses (e.g., "cancelled," "delivered," "order placed"). - Upload daily JSON files to the
speed-dash-landing-zone
bucket in the formatyyyy-mm-dd-raw_input.json
.
- Prepare a sample JSON file (e.g.,
-
Set Up Amazon SNS Topic:
- Create an SNS topic to send notifications about the processing status.
- Subscribe an email address to the topic to receive notifications.
-
Create IAM Role for Lambda:
- Define an IAM role with the necessary permissions to read from
speed-dash-landing-zone
, write tospeed-dash-target-zone
, and publish messages to the SNS topic.
- Define an IAM role with the necessary permissions to read from
-
Create and Configure AWS Lambda Function:
- Develop a Lambda function using Python. Include the
pandas
library by either packaging it with the function code or using a Lambda Layer. - Configure the function to trigger when files are uploaded to
speed-dash-landing-zone
. The function should:- Read the JSON file into a
pandas
DataFrame. - Filter records where
status
is "delivered." - Write the filtered data to a new JSON file in
speed-dash-target-zone
. - Send a success or failure notification via SNS.
- Read the JSON file into a
- Develop a Lambda function using Python. Include the
-
Set Up AWS CodeBuild for CI/CD:
- Host the Lambda function code on GitHub.
- Create an AWS CodeBuild project linked to the GitHub repository.
- Use the
buildspec.yml
file to automate the deployment of Lambda function code updates.
-
Testing and Verification:
- Upload a sample JSON file to
speed-dash-landing-zone
and ensure the Lambda function is triggered automatically. - Verify that the processed file is correctly stored in
speed-dash-target-zone
. - Check for email notifications to confirm the processing status.
- Upload a sample JSON file to
- Project Screenshots: Contains snapshots of the architecture diagram and other relevant screenshots.
- scripts: Contains scripts used for AWS Lambda functions and other processing logic.
- buildspec.yml: Configuration file for AWS CodeBuild to automate deployment.
- requirements.txt: Contains Python dependencies for the Lambda function, such as
pandas
.