Project : Data Modeling with Postgres

Introduction :

Sparkify, a startup wants to analyze their user activities based on the data collected by them via their new music streaming app. The collected data is present across several JSON files which makes it inconvinient to make queries and analyze the data. The analytics team of the startup is particularly interested in understanding the trends, what songs users are listening to.

Project Description :

The project aims at creating a ETL pipeline, which will create a Postgres database with tables, load the data from the JSON files into the database tables, making it easier to query and analyze the data. The whole pipeline is implemented using Python and SQL.

Datasets :

Sparkify's music streaming app collect data for song and user activities in JSON files, located in the following two directories :

"data/log_data"
"data/song_data"

The format of the JSON files are as follows :

Song Data

{

"num_songs": 1,

"artist_id": "ARJIE2Y1187B994AB7",

"artist_latitude": null,

"artist_longitude": null,

"artist_location": "",

"artist_name": "Line Renaud",

"song_id": "SOUPIRU12A6D4FA1E1",

"title": "Der Kleine Dompfaff",

"duration": 152.92036,

"year": 0

}

Log Data

{

"artist":"Des'ree",

"auth":"Logged In",

"firstName":"Kaylee",

"gender":"F",

"itemInSession":1,

"lastName":"Summers",

"length":246.30812,

"level":"free",

"location":"Phoenix-Mesa-Scottsdale, AZ",

"method":"PUT",

"page":"NextSong",

"registration":1540344794796.0,

"sessionId":139,

"song":"You Gotta Be",

"status":200,

"ts":1541106106796,

"userAgent":""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"",

"userId":"8"

}

The JSON files are read using function (read_json()) from the pandas library.

Schema :

The data is organised into the following (star) schema :

The Fact Table :

Table : songplays

Column Name	Type	Constraints
songplay_id	INT	SERIAL , PRIMARY KEY
start_time	TIMESTAMP
user_id	INT
level	VARCHAR
song_id	VARCHAR
artist_id	VARCHAR
session_id	INT
location	VARCHAR
user_agent	VARCHAR

The Dimension Tables :

Table : users

Column Name	Type	Constraints
user_id	INT	PRIMARY KEY
first_name	VARCHAR	NOT NULL
last_name	VARCHAR	NOT NULL
gender	CHAR(1)	NOT NULL
level	VARCHAR	NOT NULL

Table : songs

Column Name	Type	Constraints
song_id	VARCHAR	PRIMARY KEY
title	VARCHAR	NOT NULL
artist_id	VARCHAR	NOT NULL
year	INT	NOT NULL
duration	FLOAT	NOT NULL

Table : artists

Column Name	Type	Constraints
artist_id	VARCHAR	PRIMARY KEY
name	VARCHAR
location	VARCHAR
latitude	FLOAT
longitude	FLOAT

Table : time

Column Name	Type	Constraints
start_time	TIMESTAMP	NOT NULL
hour	INT	NOT NULL
day	INT	NOT NULL
week	INT	NOT NULL
month	INT	NOT NULL
year	INT	NOT NULL
weekday	INT	NOT NULL

The tables songs and artists are populated from the song data. From the log data, the dimensional tables time and users are populated as well as the fact table songplays .

The schema adapted ensures that all the required information can be retrieved using simple queries. Moreover the ETL process ensure that the database contains data relevant to the analysis.

Tech Stack :

The Project is implemented using Python 3 and SQL. The following python libraries are used in the project :

psycopg2

pandas

Running the Project :

To run the project locally, make sure the required libraries are installed.

Run the following script to create the database tables :

python create_tables.py

Run the following script to perform the ETL process and populate the tables in the database :

python etl.py

Data in the database can be verified using the test.ipynb jupyter notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project : Data Modeling with Postgres

Introduction :

Project Description :

Datasets :

Schema :

Tech Stack :

Running the Project :

About

Releases

Packages

Languages

DeepankarAcharyya/Data-Engineering-Nanodegree-Project-1-Data-Modeling-with-Postgres

Folders and files

Latest commit

History

Repository files navigation

Project : Data Modeling with Postgres

Introduction :

Project Description :

Datasets :

Schema :

Tech Stack :

Running the Project :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages