IMDB is arguably the most prolific Cinema database available on the internet. So common that we find many IMDB related datasets online, e.g.: Kaggle's IMDB 5000 Movie Dataset.
IMDB provides snapshots of their databases on titles, casting, etc. However, they do not provide user reviews. Furthermore, it is against their Terms of Use to do any form of Scraping of their webpages.
TMDB (The Movie Database) on the other hand, does provide user reviews, through their API. It is even possible to search a film by their imdb_id.
However, if for any reason you must stick to the IMDB as your base dataset, and collect information for a good portion of IMDB's 6,782,091 entries, you are doomed.
10% of 6,782,091 would amount for 678,209 API requests, and even though you may not be rate limited, it will still take days.
I've then created this script that can be used to download, with good level of paralellism, TMDB movies by their IMDB id.
Apart from the extra data that TMDB makes available (like full release date, for example), we attach the IMDB ID that was found (as idIMDB) to the TMDB movie JSON, and save it in S3.
This solution is composed by the following components:
-
AWS SQS Queue that will receive all the requests to trigger the TMDB data download
-
AWS Lambda Function that will consume the messages from SQS and perform the download
-
AWS S3 Bucket, required for dumping the downloaded data
-
Fleet Launcher Jupyter Notebook, that will prepare the messages ans send to SQS
The current code also has a small command line that helps us with what is needed in order to develop and run this program:
-
python tdd development: creates the localconfig.inithat is reponsible for keeping your TMDB api key and your data lake bucket name -
python tdd deploy: installs the AWS infrastructure automatically interactively for you -
python tdd download:launches the downloader locally, downloads a single item, ideal for debugging -
python tdd simulate:send the message to SQS to download one single partition, ideal to test the system in AWS.
-
Clone the repository locally, and
-
Setup your development environment: you will be prompted for information such as your
TMDB_API_KEYand yourS3_BUCKET_NAME(for your datalake)
cd ~/[workspace_path]
git clone git@github.com:hudsonmendes/lambda-tmdb-distributed-downloader.git
cd lambda-tmdb-distributed-downloader
python tdd development
Important: this step requires you to have your aws configure run previously.
-
Run the deployment code,
-
Tell where you want the system to be deployed (parameters), and
-
Check to see if the resources were properly created
# you must be connected to your amazon account
# aws configure
# here we will deploy the components to lambda
python tdd deploy
Whant o help make this better?
-
Send me a Pull Request
-
Ping me on twitter: http://twitter.com/hudsonmendes