Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Getting Started Guide for S3 #145

Merged
merged 4 commits into from
Jul 30, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 81 additions & 3 deletions docs/getting_started.md → docs/getting_started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,89 @@ tags:
- Iceberg
sidebar_position: 2
---
# OpenHouse on Spark & HDFS

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

<Tabs>
<TabItem value="s3" label="S3" default>
# OpenHouse with Spark & S3

In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our
environment will include all the core OpenHouse services such as [Catalog Service](./intro.md#catalog-service),
[House Table service](./intro.md#house-table-service) and [others](./intro.md#control-plane-for-tables),
[a Spark 3.1 engine](https://spark.apache.org/releases/spark-release-3-1-1.html) and
also [MinIO S3 Instance](https://min.io/docs/minio/container/index.html).
In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data.
For more information on various docker environments and how to set them up
please see the [SETUP.md](https://github.com/linkedin/openhouse/blob/main/SETUP.md) guide.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how
OpenHouse manages access control.

### Prerequisites
- [Docker CLI](https://docs.docker.com/get-docker/)
- [Docker Compose CLI](https://github.com/docker/compose-cli/blob/main/INSTALL.md)

## Create and write to OpenHouse Tables
### Get environment ready
First, clone [OpenHouse github repository](https://github.com/linkedin/openhouse) and
run `./gradlew build` command at the root directory. After the command succeeds you should see `BUILD SUCCESSFUL`
message.

```shell
openhouse$main> ./gradlew build
```

Execute `docker compose -f infra/recipes/docker-compose/oh-s3-spark/docker-compose.yml up -d --build` command to
bring up docker containers for OpenHouse services, Spark and S3.

### Run SQL commands
Let us execute some basic SQL commands to create table, add data and query data.

First login to the driver node and start the spark-shell.
```shell
oh-hadoop-spark$main> docker exec -it local.spark-master /bin/bash

openhouse@0a9ed5853291:/opt/spark$ bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18 \
--jars openhouse-spark-runtime_2.12-*-all.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog \
--conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000 \
--conf spark.sql.catalog.openhouse.s3.access-key-id=admin \
--conf spark.sql.catalog.openhouse.s3.secret-access-key=password \
--conf spark.sql.catalog.openhouse.s3.path-style-access=true \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster
```
:::note
the configuration `spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080` points to the docker container
running the [OpenHouse Catalog Service](./intro.md#catalog-service).
:::
:::note
HotSushi marked this conversation as resolved.
Show resolved Hide resolved
the configuration `spark.sql.catalog.openhouse.io-impl` is set to `org.apache.iceberg.aws.s3.S3FileIO` in order
enable IO operations on S3. Parameters for this connection is configured via the prefix `spark.sql.catalog.openhouse.s3.*`.
:::
:::note
you can access the MinIO UI at `http://localhost:9871` of your host machine and inspect the state of objects
created for your table. The username is `admin` and password is `password` for the MinIO docker setup.
:::

</TabItem>
<TabItem value="hdfs" label="HDFS">
# OpenHouse with Spark & HDFS

In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our
environment will include all the core OpenHouse services such as [Catalog Service](./intro.md#catalog-service),
[House Table service](./intro.md#house-table-service) and [others](./intro.md#control-plane-for-tables),
[a Spark 3.1 engine](https://spark.apache.org/releases/spark-release-3-1-1.html) and
also [HDFS namenode and datanode](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes). By the end of this walkthrough, we will have created some tables on OpenHouse,
inserted data in them, and queried data. For more information on various docker environments and how to set them up
also [HDFS namenode and datanode](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes).
In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data.
For more information on various docker environments and how to set them up
please see the [SETUP.md](https://github.com/linkedin/openhouse/blob/main/SETUP.md) guide.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how
Expand Down Expand Up @@ -59,6 +134,9 @@ openhouse@0a9ed5853291:/opt/spark$ bin/spark-shell --packages org.apache.iceber
the configuration `spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080` points to the docker container
running the [OpenHouse Catalog Service](./intro.md#catalog-service).
:::
</TabItem>
</Tabs>


Once the spark-shell is up, we run the following command to create a simple table.

Expand Down