
Information courtesy of IMDb (https://www.imdb.com). Used with permission.
- Introduction
- Import Data into DuckDB
- Quick Visualisation with YouPlot
- Data Migration to PostgreSQL
- Data Transforamtion and Analysis in PostgreSQL
- Visualisation using Gnuplot
- References
In this project, the data released by IMDb, an extremely popular source for movie, TV and celebrity content, were analysed using SQL (DuckDB and PostgreSQL). Visualisations were presented by Gnuplot and Youplot (an interesting and convenient tool for visualing via DuckDB CLI).
./duckdb your_database.duckdb
CREATE TABLE your_tables AS
SELECT * FROM read_csv('/your_directory/*.*.tsv',
delim='\t',
null_padding=true,
sample_size=-1);
.tables
SELECT * FROM your_tables;
./duckdb your_database.duckdb -s \
"COPY (SELECT titleType, count(titleType) AS total_counts \
FROM your_table GROUP BY 1 ORDER BY 2 DESC) \
TO '/dev/stdout' WITH (format 'csv', header)" \
| uplot bar -d, -H -c blue -t "Counts of the Title Types"
./duckdb your_database.duckdb -s \
"COPY (SELECT startYear AS year, COUNT(titleType) AS 'numbers of productions' \
FROM your_table WHERE titleType = 'tvMovie' AND startYear IS NOT NULL \
GROUP BY startYear) TO '/dev/stdout' WITH (format 'csv', header)" \
| uplot line -d, -w 55 -h 15 -t "Movie Production Changes Over the Years" \
--xlim 1920,2026 --ylim 0,4000 -c blue
./duckdb your_database.duckdb -s \
"COPY (SELECT averageRating FROM your_table) \
TO '/dev/stdout' WITH (format 'csv', header)" \
| uplot hist --nbins 20 -c blue --title "Distribution of the Averageratings"
./duckdb your_database.duckdb -s "
COPY (
SELECT
runtimeMinutes
FROM your_table
WHERE runtimeMinutes IS NOT NULL) TO '/dev/stdout' WITH (format 'csv', header)
" | uplot boxplot -H -d, -t 'Overall Running Time Distribution'
-c blue --xlabel 'minutes' --xlim 0,150
./duckdb your_database.duckdb -s "
COPY (
SELECT
CASE WHEN titleType = 'movie' THEN runtimeMinutes ELSE NULL END AS movie,
FROM your_table
WHERE runtimeMinutes IS NOT NULL
) TO '/dev/stdout' WITH (format 'csv', header)
" | cut -f1 -d, | uplot boxplot -H -d, -t 'Movie Running Time Distribution'
-c blue --xlabel 'minutes' --xlim 0,150
overall V.S. movie
./duckdb your_databse.duckdb
#INSTALL postgres;
LOAD postgres;
CREATE SECRET (
TYPE POSTGRES,
HOST '127.0.0.1',
PORT 5432,
DATABASE your_database,
USER 'your_user',
PASSWORD 'your_password'
);
ATTACH '' AS postgres_db (TYPE POSTGRES);
CREATE TABLE postgres_db.your_tables (
*** VARCHAR,
*** VARCHAR,
*** INT,
*** INT
);
#SHOW ALL TABLES;
SELECT * FROM postgres_db.your_tables;
INSERT INTO postgres_db.your_tables
SELECT * FROM your_tables;
-
name transformation: split the professions and the known-for titles of the people into new tables
-
genre transformation: split the genres into new table
-
crew member transformation: split the directors and writers of the programs into new tables
Data analysis 🔗
1. Profession analysis
2. Age distribution
Comparing to the distribution excluding people who are still alive
Key takeaways:
- Most people who are currently working or have worked in the industry are between the ages of 40 and 50.
- The average lifespan of people who worked in the industry is around 80 years.
3. Localised counts
4. Most voted productions
5. Genre counts of the productions
IMDb Non-Commercial Datasets
Youplot
DuckDB
PostgreSQL
Gnuplot
Country Codes