project-dezoomcamp

Project For dezoomcamp with datatalks club

Background Issue

(This is a hypothetical and synthetic requirement formulated for the zoomcamp project).

Make some data pipeline from dataset Spotify tracks and Get some Popularity Tracks, Artist or etc and put all the information to dashboard for stakeholder

Project high level design

This project produces a pipeline which:

Build cloud infrastructure using Terraform
pull the raw data into GCP cloud
Transforms the raw data
Joins the artists and tracks table to provide popularity write them back into BigQuery
Produce dashboard tiles in Google Data studio.
This allows the analytics to view the combined tracks and artists popularity information for quick review.

Dataset

Spotify Dataset

Technology choices

Cloud: GCP
Datalake: GCP Bucket
Infrastructure as code (IaC): Terraform
Workflow orchestration: Airflow
Data Warehouse: BigQuery
Transformation: Google Cloud Dataproc

Installation Google Cloud Infrastructure Using Terraform

# Refresh service-account's auth-token for this session
gcloud auth application-default login --no-launch-browser

# Initialize state file (.tfstate)
cd terraform/
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"

# Create new infra
terraform apply -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"

# Delete infra after your work, to avoid costs on any running services
terraform destroy -var="project=<your project id>"

Installation Google Dataproc

Because I already tried build data proc using terraform sucessfullt but when I running the pyspark job is still error. and I build google dataproc using wizard from gcp console and running the pyspark job is successfully, I recommended build the google dataproc using wizard from gcp console.

dataproc-setup

Setup Airflow

## Installation Airflow

```shell
docker-compose up

Airflow Webserver http://localhost:8090

user: admin
password : admin

# dont forget put google project on .env
GCP_PROJECT_ID=
GCP_GCS_BUCKET=

Dashboard

Total number of tracks
Total number of artists
Most popular song - by popularity
Most popular artist - by followers
Most Tracks - Sort by Artist

View-Dashboard-On-Google-Datastudio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

project-dezoomcamp

Background Issue

Project high level design

Dataset

Technology choices

Installation Google Cloud Infrastructure Using Terraform

Installation Google Dataproc

Setup Airflow

Dashboard

Files

README.md

Latest commit

History

README.md

File metadata and controls

project-dezoomcamp

Background Issue

Project high level design

Dataset

Technology choices

Installation Google Cloud Infrastructure Using Terraform

Installation Google Dataproc

Setup Airflow

Dashboard