IMDB to TMDB, Distributed SQS+Lambda+S3 Downloader

Summary

IMDB is arguably the most prolific Cinema database available on the internet. So common that we find many IMDB related datasets online, e.g.: Kaggle's IMDB 5000 Movie Dataset.

The Problem

IMDB provides snapshots of their databases on titles, casting, etc. However, they do not provide user reviews. Furthermore, it is against their Terms of Use to do any form of Scraping of their webpages.

TMDB, an Alternative to IMDB

TMDB (The Movie Database) on the other hand, does provide user reviews, through their API. It is even possible to search a film by their imdb_id.

However, if for any reason you must stick to the IMDB as your base dataset, and collect information for a good portion of IMDB's 6,782,091 entries, you are doomed.

10% of 6,782,091 would amount for 678,209 API requests, and even though you may not be rate limited, it will still take days.

Solution

I've then created this script that can be used to download, with good level of paralellism, TMDB movies by their IMDB id.

Apart from the extra data that TMDB makes available (like full release date, for example), we attach the IMDB ID that was found (as idIMDB) to the TMDB movie JSON, and save it in S3.

Components & Resources

This solution is composed by the following components:

AWS SQS Queue that will receive all the requests to trigger the TMDB data download
AWS Lambda Function that will consume the messages from SQS and perform the download
AWS S3 Bucket, required for dumping the downloaded data
Fleet Launcher Jupyter Notebook, that will prepare the messages ans send to SQS

Small "CLI" (Command Line INterface)

The current code also has a small command line that helps us with what is needed in order to develop and run this program:

python tdd development: creates the local config.ini that is reponsible for keeping your TMDB api key and your data lake bucket name
python tdd deploy: installs the AWS infrastructure automatically interactively for you
python tdd download: launches the downloader locally, downloads a single item, ideal for debugging
python tdd simulate: send the message to SQS to download one single partition, ideal to test the system in AWS.

How to Run

Setup Development Environment

Clone the repository locally, and
Setup your development environment: you will be prompted for information such as your TMDB_API_KEY and your S3_BUCKET_NAME (for your datalake)

cd ~/[workspace_path]
git clone git@github.com:hudsonmendes/lambda-tmdb-distributed-downloader.git
cd lambda-tmdb-distributed-downloader
python tdd development

Deploy the lambda to your AWS account

Important: this step requires you to have your aws configure run previously.

Run the deployment code,
Tell where you want the system to be deployed (parameters), and
Check to see if the resources were properly created

# you must be connected to your amazon account
# aws configure

# here we will deploy the components to lambda
python tdd deploy

Contributions

Whant o help make this better?

Send me a Pull Request
Ping me on twitter: http://twitter.com/hudsonmendes

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
tdd		tdd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
launch_fleet.ipynb		launch_fleet.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDB to TMDB, Distributed SQS+Lambda+S3 Downloader

Summary

The Problem

TMDB, an Alternative to IMDB

Solution

Components & Resources

Small "CLI" (Command Line INterface)

How to Run

Setup Development Environment

Deploy the lambda to your AWS account

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

hudsonmendes/lambda-tmdb-distributed-downloader

Folders and files

Latest commit

History

Repository files navigation

IMDB to TMDB, Distributed SQS+Lambda+S3 Downloader

Summary

The Problem

TMDB, an Alternative to IMDB

Solution

Components & Resources

Small "CLI" (Command Line INterface)

How to Run

Setup Development Environment

Deploy the lambda to your AWS account

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages