- Introduction
- Project Architecture
- Requirements
- Environment Set Up
- Repository Structure
- Steps
- Figures
- Acknowledgements
End-to-End Data Engineering Project using Kaggle YouTube Trending Dataset
This is a data engineering project simulating a request from a customer launching a new data-driven campaign with the main advertising channel: YouTube.
The goal of this project is to help our Analytics team answer the following questions:
- How do we categorize videos based on their comments and statistics?
- What factors affect how popular a YouTube video will be?
How will my customer measure success?
- Data Ingestion - Ingest data, one-offs and incrementally
- Data Lake - Design and build a new Data Lake architecture
- AWS Cloud - AWS as the cloud provider
- ETL Design - Extract, transform and load data efficiently
- Scalability - The data architecture should scale efficiently
- Reporting - Build a Business Intelligence tier, including Dashboards (In Progress)
- Build a data lake from scratch in AWS S3 (joining semi-structured and structured data).
- Design lake house architecture while implementing best practices (cost and performance).
- Design data lake in layers, partitioned for cost-performance (e.g., landing, cleansed as SSOT, reporting for BI users, WORM (Write Once Ready Many)
- Perform ETL in AWS Glue using Spark jobs, Amazon SNS for alerting, SQL using Amazon Athena and Spark SQL (i.e., impact of querying the optimized data layers).
- Ingest changes incrementally and manage schema evolution.
- Connect to BI dashboards. (To be continued)
- Top trending videos on YouTube
- What is "Trending"?: YouTube uses factors, including users interactions (e.g., number of views, shares, comments and likes). Not the most-viewed videos overall per calendar year.
- Source:
- Kaggle
- Data collected using Trending YouTube Scraper
The reason for building the Reporting / Analytics layer is to avoid joining large complex queries. We can combine data in the ETL pipeline and just give the final version of the table to the Data Analyst or Data Scientist. This makes things easier to quickly analyze the data and build machine learning models.
Using Cleansed Layer
SELECT ref.snipper_title, stats.title, stats.title
FROM raw_statstics stats
INNER JOIN cleansed_statistics_reference_data ref on (stats.category_id = ref.id)
WHERE ref.id=2;
Using Reporting Layer
SELECT snippet_title, title, title
FROM rpt_youtube_statistics_categories
WHERE id=2;
Languages
- Bash
- Python 3.8
- Pandas
Technologies
- Spark 3.1
- AWS Athena
- AWS Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
- AWS Lambda
- AWS Identity and Access Management (IAM)
Third-Party Libraries
- AWS CLI
- Install and configure AWS CLI
- Steps to get our data
- Download from Kaggle
- Create an Amazon S3 bucket, for our landing bucket
- Use naming convention:
- s3://company-raw-awsregion-awsaccountID-env/source/source_region
- env = dev, test, prod
- source = name or indicator of source
- source_region = region of data source
- s3://company-raw-awsregion-awsaccountID-env/source/source_region
- In our case, the bucket name is: deproject-on-youtube-raw-useast1-dev/
- Use naming convention:
- Copy the data to S3, using our AWS S3 CLI commands
./athena/
contains SQL for viewing tables on Athena
./bash/
contains the CLI commands to copy files to S3 bucket
./data/
contains raw data from kaggle and sample output data from Athena
./glue/
contains the spark scripts to launch spark jobs using Glue
./lambda/
contains py scripts for the ETL job in Lambda
./static/
contains images used for this project
- Create IAM user account
- Install AWS CLI (See Environment Set Up above)
- Create S3 Bucket (Raw Area) and upload data using CLI commands
- Build Glue Crawler and Catalog for json data (unstructured)
- s3://deproject-on-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data/
- Use Athena and SQL to query data
- Solve any errors during preprocessing
- Configure desired schema changes in Glue console
- Write ETL job in Lambda, and clean data
- Set target to cleansed S3 bucket (Cleansed / Enriched)
- Add trigger to this function so that this runs when a new object is added to the landing S3 bucket (Raw)
- Query clean data on Athena
- Repeat steps 5 to 9, but for csv data (structured)
- s3://deproject-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=ca/
- Join unstructured data and structured data using glue join script
- This will create a table (final_analytics) in the Data Catalog and on subsequent runs, update the schema and add new partitions
- Set target to analytics S3 bucket (Reporting / Analytics)
- Use Athena and SQL to query final_analytics table
- Connect to BI Tool of choice for visualizing data (To be continued)
Inspiration, code snippets, tutorials, etc.