DataEngineering-YouTube-Trending-Project

Introduction

End-to-End Data Engineering Project using Kaggle YouTube Trending Dataset

This is a data engineering project simulating a request from a customer launching a new data-driven campaign with the main advertising channel: YouTube.

The goal of this project is to help our Analytics team answer the following questions:

How do we categorize videos based on their comments and statistics?
What factors affect how popular a YouTube video will be?

Goals and Success Criteria

How will my customer measure success?

Data Ingestion - Ingest data, one-offs and incrementally
Data Lake - Design and build a new Data Lake architecture
AWS Cloud - AWS as the cloud provider
ETL Design - Extract, transform and load data efficiently
Scalability - The data architecture should scale efficiently
Reporting - Build a Business Intelligence tier, including Dashboards (In Progress)

In this project, I demonstrate how to:

Build a data lake from scratch in AWS S3 (joining semi-structured and structured data).
Design lake house architecture while implementing best practices (cost and performance).
Design data lake in layers, partitioned for cost-performance (e.g., landing, cleansed as SSOT, reporting for BI users, WORM (Write Once Ready Many)
Perform ETL in AWS Glue using Spark jobs, Amazon SNS for alerting, SQL using Amazon Athena and Spark SQL (i.e., impact of querying the optimized data layers).
Ingest changes incrementally and manage schema evolution.
Connect to BI dashboards. (To be continued)

About our dataset

Top trending videos on YouTube
- What is "Trending"?: YouTube uses factors, including users interactions (e.g., number of views, shares, comments and likes). Not the most-viewed videos overall per calendar year.
Source:
- Kaggle
- Data collected using Trending YouTube Scraper

Project Architecture

Cleansed / Enriched vs. Reporting / Analytics Layer

The reason for building the Reporting / Analytics layer is to avoid joining large complex queries. We can combine data in the ETL pipeline and just give the final version of the table to the Data Analyst or Data Scientist. This makes things easier to quickly analyze the data and build machine learning models.

Using Cleansed Layer

SELECT ref.snipper_title, stats.title, stats.title
FROM raw_statstics stats
    INNER JOIN cleansed_statistics_reference_data ref on (stats.category_id = ref.id)
WHERE ref.id=2;

Using Reporting Layer

SELECT snippet_title, title, title
FROM rpt_youtube_statistics_categories
WHERE id=2;

Data Cleansing - Semi-structured to Structured data

Requirements

Languages

Bash
Python 3.8
Pandas

Technologies

Spark 3.1
AWS Athena
AWS Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
AWS Lambda
AWS Identity and Access Management (IAM)

Third-Party Libraries

AWS CLI

Environment Setup

Install and configure AWS CLI
Steps to get our data
1. Download from Kaggle
2. Create an Amazon S3 bucket, for our landing bucket
  - Use naming convention:
    - s3://company-raw-awsregion-awsaccountID-env/source/source_region
      - env = dev, test, prod
      - source = name or indicator of source
      - source_region = region of data source
  - In our case, the bucket name is: deproject-on-youtube-raw-useast1-dev/
Copy the data to S3, using our AWS S3 CLI commands

Repository Structure

./athena/ contains SQL for viewing tables on Athena

./bash/ contains the CLI commands to copy files to S3 bucket

./data/ contains raw data from kaggle and sample output data from Athena

./glue/ contains the spark scripts to launch spark jobs using Glue

./lambda/ contains py scripts for the ETL job in Lambda

./static/ contains images used for this project

Steps

Create IAM user account
Install AWS CLI (See Environment Set Up above)
Create S3 Bucket (Raw Area) and upload data using CLI commands
Build Glue Crawler and Catalog for json data (unstructured)
- s3://deproject-on-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data/
Use Athena and SQL to query data
Solve any errors during preprocessing
Configure desired schema changes in Glue console
Write ETL job in Lambda, and clean data
- Set target to cleansed S3 bucket (Cleansed / Enriched)
- Add trigger to this function so that this runs when a new object is added to the landing S3 bucket (Raw)
Query clean data on Athena
Repeat steps 5 to 9, but for csv data (structured)
- s3://deproject-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=ca/
Join unstructured data and structured data using glue join script
- This will create a table (final_analytics) in the Data Catalog and on subsequent runs, update the schema and add new partitions
- Set target to analytics S3 bucket (Reporting / Analytics)
Use Athena and SQL to query final_analytics table
Connect to BI Tool of choice for visualizing data (To be continued)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataEngineering-YouTube-Trending-Project

Table of Contents

Introduction

Goals and Success Criteria

In this project, I demonstrate how to:

About our dataset

Project Architecture

Cleansed / Enriched vs. Reporting / Analytics Layer

Data Cleansing - Semi-structured to Structured data

Requirements

Environment Setup

Repository Structure

Steps

Figures

Figure 1.0 - S3 Buckets

Figure 2.0 - IAM Roles

Figure 3.0 - Glue Crawlers

Figure 3.1 - Glue Tables

Figure 4.0 - Athena - Final Query

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
athena		athena
bash		bash
data		data
glue		glue
lambda		lambda
static		static
.gitattributes		.gitattributes
LICENSE.md		LICENSE.md
README.md		README.md

License

JerichoCruz/youtube-trending-project

Folders and files

Latest commit

History

Repository files navigation

DataEngineering-YouTube-Trending-Project

Table of Contents

Introduction

Goals and Success Criteria

In this project, I demonstrate how to:

About our dataset

Project Architecture

Cleansed / Enriched vs. Reporting / Analytics Layer

Data Cleansing - Semi-structured to Structured data

Requirements

Environment Setup

Repository Structure

Steps

Figures

Figure 1.0 - S3 Buckets

Figure 2.0 - IAM Roles

Figure 3.0 - Glue Crawlers

Figure 3.1 - Glue Tables

Figure 4.0 - Athena - Final Query

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages