-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit e4f86bd
Showing
6 changed files
with
5,366 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Modern Data Lake Storage Layers | ||
|
||
This repository contains supporting assets for my research in modern Data Lake storage layers like Apache Hudi, Apache Iceberg, and Delta Lake. | ||
|
||
Specifically, there's a [CloudFormation template](cloudformation/emr-studio-cluster.cfn.yaml) to create an EMR cluster and EMR Studio with the necessary requirements and Jupyter notebooks with the example walkthroughs. | ||
|
||
## Pre-requisites | ||
|
||
You'll need an AWS Account in which you have administrator privileges and the ability to deploy a CloudFormation template. The template **will** create an EMR Cluster and S3 bucket that will incur charges - be sure to either shut down the cluster when done or delete the CloudFormation stack. In order to delete the CloudFormation stack, you'll need to: | ||
|
||
- Manually delete any EMR Studio Workspaces you created | ||
- Manually empty the S3 bucket created by CloudFormation | ||
- Manually delete the VPC created by CloudFormation due to auto-created rules | ||
|
||
## Overview | ||
|
||
The included CloudFormation template creates a new VPC and EMR Cluster for you to be able to run the notebooks. An EMR Studio is also created and you can find the Studio URL in the `Outputs` tab of your CloudFormation Stack. | ||
|
||
Once the stack is done creating, you'll need to navigate to EMR Studio and create a new workspace attached to the "data-lakes" cluster. | ||
|
||
 | ||
|
||
Inside the workspace you either upload each notebook individually from the [notebooks/](notebooks/) folder or simply connect to this repository by using the "Git" icon on the left-hand side. |
Oops, something went wrong.