Skip to content

andersbogsnes/pydata_iceberg_demo

Repository files navigation

Pydata Copenhagen Apache Iceberg Demo

This tutorial will introduce you to the Apache Iceberg table format and walk through how we can use it from a Python point-of-view to work with large amounts of data stored in object storage.

Prerequisites

To be able to follow along, make sure you've cloned the repository locally.

Docker

You will need to have Docker and Docker Compose installed to run the required images.

[Optional] AWS Account

Additionally, if you want to follow the AWS examples, you will need an AWS account and have created a set of access keys. Make a local copy of .env.example, rename it to .env and add the required secrets.

You can use the terraform files found in the tf folder to create the necessary AWS infrastructure

Start the notebooks

Start the docker compose containers - this might take a while the first time as you have to download the various images

docker compose up -d

Handy dandy links

Here are the links to the various services that Docker will spin up.

Jupyterlab

http://localhost:8080

Minio Console

http://localhost:9001

  • Username: minio
  • Password: minio1234

Dremio

http://localhost:9047

  • Username: dremio
  • Password: dremio123

Python Environment

While the notebooks themselves run within a dockerized environment, this tutorial includes a CLI to help with the setup.

With your favourite virtualenv manager, create a virtualenv and activate it, following the guide for your OS.

(I recommend uv)

When the venv is activated, install the demo cli

pip install .

The demo command will now be available

demo --help

Dataset

We are using the Steam Review dataset that can be found on Kaggle. This dataset is around 13Gb unzipped and contains 80,000 game reviews scraped from a Steam API endpoint

There's also an API endpoint for mapping game_id to a name: https://api.steampowered.com/ISteamApps/GetAppList/v2

CLI

To download the data, run

demo data download

To setup the lake, run

demo lake setup

We're dealing with a lot of files, so this might take a while.

Optional: AWS Setup

If you want to follow along with the AWS part, make sure you have the AWS CLI installed and configured. Additionally, you will need terraform

Creating the AWS infrastructure with terraform

terraform -chdir=tf init
terraform -chdir=tf apply

Upload data to S3 bucket

To upload the full dataset to an S3 bucket will take some time due to the large dataset. The easiest way to ensure that all the data is uploaded is to use the aws s3 sync command

aws s3 sync data/SteamReviews2024 s3://<name of your bucket>/extract/reviews

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published