Skip to content

Data transformation and analysis using SQL . Visualisation using Gnuplot.

Notifications You must be signed in to change notification settings

alecfei/imdb_data

Repository files navigation

flowchart

Information courtesy of IMDb (https://www.imdb.com). Used with permission.


Table of Contents

  1. Introduction
  2. Import Data into DuckDB
  3. Quick Visualisation with YouPlot
  4. Data Migration to PostgreSQL
  5. Data Transforamtion and Analysis in PostgreSQL
  6. Visualisation using Gnuplot
  7. References

Introduction

In this project, the data released by IMDb, an extremely popular source for movie, TV and celebrity content, were analysed using SQL (DuckDB and PostgreSQL). Visualisations were presented by Gnuplot and Youplot (an interesting and convenient tool for visualing via DuckDB CLI).

Import data into DuckDB

./duckdb your_database.duckdb
CREATE TABLE your_tables AS
  SELECT * FROM read_csv('/your_directory/*.*.tsv',
    delim='\t',
    null_padding=true,
    sample_size=-1);
.tables
SELECT * FROM your_tables;

Quick visualisation using Youplot

title type counts

./duckdb your_database.duckdb -s \
"COPY (SELECT titleType, count(titleType) AS total_counts \
FROM your_table GROUP BY 1 ORDER BY 2 DESC) \
TO '/dev/stdout' WITH (format 'csv', header)"  \
| uplot bar -d, -H -c blue -t "Counts of the Title Types"

barplot

movie production trend

./duckdb your_database.duckdb -s \
"COPY (SELECT startYear AS year, COUNT(titleType) AS 'numbers of productions' \
FROM your_table WHERE titleType = 'tvMovie' AND startYear IS NOT NULL \
GROUP BY startYear) TO '/dev/stdout' WITH (format 'csv', header)" \
| uplot line -d, -w 55 -h 15 -t "Movie Production Changes Over the Years" \
--xlim 1920,2026 --ylim 0,4000 -c blue

lineplot

average rating distribution

./duckdb your_database.duckdb -s \
"COPY (SELECT averageRating FROM your_table) \
TO '/dev/stdout' WITH (format 'csv', header)" \
| uplot hist --nbins 20 -c blue --title "Distribution of the Averageratings"

lineplot

running time distribution

./duckdb your_database.duckdb -s "
COPY (
    SELECT 
        runtimeMinutes
    FROM your_table 
    WHERE runtimeMinutes IS NOT NULL) TO '/dev/stdout' WITH (format 'csv', header)
" | uplot boxplot -H -d, -t 'Overall Running Time Distribution' 
-c blue --xlabel 'minutes' --xlim 0,150
./duckdb your_database.duckdb -s "
COPY (
    SELECT 
        CASE WHEN titleType = 'movie' THEN runtimeMinutes ELSE NULL END AS movie,
    FROM your_table 
    WHERE runtimeMinutes IS NOT NULL
) TO '/dev/stdout' WITH (format 'csv', header)
" | cut -f1 -d, | uplot boxplot -H -d, -t 'Movie Running Time Distribution' 
-c blue --xlabel 'minutes' --xlim 0,150

overall V.S. movie

lineplot lineplot

Data migration from DuckDB to PostgreSQL

Access database through DuckDB CLI

./duckdb your_databse.duckdb

Load postgres extension

#INSTALL postgres;
LOAD postgres;

Create connection with PostgreSQL Database

CREATE SECRET (
      TYPE POSTGRES,
      HOST '127.0.0.1',
      PORT 5432,
      DATABASE your_database,
      USER 'your_user',
      PASSWORD 'your_password'
  );

ATTACH '' AS postgres_db (TYPE POSTGRES);

Create table in PostgresSQL through the connection

CREATE TABLE postgres_db.your_tables (
      *** VARCHAR,
      *** VARCHAR,
      *** INT,
      *** INT
);

Verify the existence of the table in Postgres

#SHOW ALL TABLES;
SELECT * FROM postgres_db.your_tables;

Copy from DuckDB to Postgres

INSERT INTO postgres_db.your_tables
  SELECT * FROM your_tables;

Data transformation and analysis in PostgreSQL

Data transformation

Data analysis 🔗

Visualisation using Gnuplot

1. Profession analysis

profession_count

2. Age distribution

age_distribution_all

Comparing to the distribution excluding people who are still alive

age_distribution

Key takeaways:

  • Most people who are currently working or have worked in the industry are between the ages of 40 and 50.
  • The average lifespan of people who worked in the industry is around 80 years.

3. Localised counts

localised_quantity

4. Most voted productions

most_voted

5. Genre counts of the productions

genre_count

References

IMDb Non-Commercial Datasets
Youplot
DuckDB
PostgreSQL
Gnuplot
Country Codes

About

Data transformation and analysis using SQL . Visualisation using Gnuplot.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published