This tutorial will introduce you to the Apache Iceberg table format and walk through how we can use it from a Python point-of-view to work with large amounts of data stored in object storage.
To be able to follow along, make sure you've cloned the repository locally.
You will need to have Docker and Docker Compose installed to run the required images.
Additionally, if you want to follow the AWS examples, you will need an AWS account and have
created a set of access keys. Make a local copy of .env.example
, rename it to .env
and add the required secrets.
You can use the terraform files found in the tf
folder to create the
necessary AWS infrastructure
Start the docker compose containers - this might take a while the first time as you have to download the various images
docker compose up -d
Here are the links to the various services that Docker will spin up.
- Username: minio
- Password: minio1234
- Username: dremio
- Password: dremio123
While the notebooks themselves run within a dockerized environment, this tutorial includes a CLI to help with the setup.
With your favourite virtualenv manager, create a virtualenv and activate it, following the guide for your OS.
(I recommend uv)
When the venv is activated, install the demo
cli
pip install .
The demo
command will now be available
demo --help
We are using the Steam Review dataset that can be found on Kaggle. This dataset is around 13Gb unzipped and contains 80,000 game reviews scraped from a Steam API endpoint
There's also an API endpoint for mapping game_id
to a name
: https://api.steampowered.com/ISteamApps/GetAppList/v2
To download the data, run
demo data download
To setup the lake, run
demo lake setup
We're dealing with a lot of files, so this might take a while.
If you want to follow along with the AWS part, make sure you have the AWS CLI installed and configured. Additionally, you will need terraform
terraform -chdir=tf init
terraform -chdir=tf apply
To upload the full dataset to an S3 bucket will take some time due to the large dataset.
The easiest way to ensure that all the data is uploaded is to use the aws s3 sync
command
aws s3 sync data/SteamReviews2024 s3://<name of your bucket>/extract/reviews