Pydata Copenhagen Apache Iceberg Demo

This tutorial will introduce you to the Apache Iceberg table format and walk through how we can use it from a Python point-of-view to work with large amounts of data stored in object storage.

Prerequisites

To be able to follow along, make sure you've cloned the repository locally.

Docker

You will need to have Docker and Docker Compose installed to run the required images.

[Optional] AWS Account

Additionally, if you want to follow the AWS examples, you will need an AWS account and have created a set of access keys. Make a local copy of .env.example, rename it to .env and add the required secrets.

You can use the terraform files found in the tf folder to create the necessary AWS infrastructure

Start the notebooks

Start the docker compose containers - this might take a while the first time as you have to download the various images

docker compose up -d

Handy dandy links

Here are the links to the various services that Docker will spin up.

Jupyterlab

http://localhost:8080

Minio Console

http://localhost:9001

Username: minio
Password: minio1234

Dremio

http://localhost:9047

Username: dremio
Password: dremio123

Python Environment

While the notebooks themselves run within a dockerized environment, this tutorial includes a CLI to help with the setup.

With your favourite virtualenv manager, create a virtualenv and activate it, following the guide for your OS.

(I recommend uv)

When the venv is activated, install the demo cli

pip install .

The demo command will now be available

demo --help

Dataset

We are using the Steam Review dataset that can be found on Kaggle. This dataset is around 13Gb unzipped and contains 80,000 game reviews scraped from a Steam API endpoint

There's also an API endpoint for mapping game_id to a name: https://api.steampowered.com/ISteamApps/GetAppList/v2

CLI

To download the data, run

demo data download

To setup the lake, run

demo lake setup

We're dealing with a lot of files, so this might take a while.

Optional: AWS Setup

If you want to follow along with the AWS part, make sure you have the AWS CLI installed and configured. Additionally, you will need terraform

Creating the AWS infrastructure with terraform

terraform -chdir=tf init
terraform -chdir=tf apply

Upload data to S3 bucket

To upload the full dataset to an S3 bucket will take some time due to the large dataset. The easiest way to ensure that all the data is uploaded is to use the aws s3 sync command

aws s3 sync data/SteamReviews2024 s3://<name of your bucket>/extract/reviews

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
images/jupyter		images/jupyter
notebooks		notebooks
src/pydata		src/pydata
tf		tf
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
prepare_data.py		prepare_data.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pydata Copenhagen Apache Iceberg Demo

Prerequisites

Docker

[Optional] AWS Account

Start the notebooks

Handy dandy links

Jupyterlab

Minio Console

Dremio

Python Environment

Dataset

CLI

Optional: AWS Setup

Creating the AWS infrastructure with terraform

Upload data to S3 bucket

About

Uh oh!

Releases

Packages

Languages

andersbogsnes/pydata_iceberg_demo

Folders and files

Latest commit

History

Repository files navigation

Pydata Copenhagen Apache Iceberg Demo

Prerequisites

Docker

[Optional] AWS Account

Start the notebooks

Handy dandy links

Jupyterlab

Minio Console

Dremio

Python Environment

Dataset

CLI

Optional: AWS Setup

Creating the AWS infrastructure with terraform

Upload data to S3 bucket

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages