|
2 | 2 |
|
3 | 3 | ## Overview
|
4 | 4 |
|
| 5 | +* Persist data to PostgreSQL. |
| 6 | +* Monitor changes to data using the Debezium Connector. |
| 7 | +* Stream data from a Kafka topic using PySpark (Spark Streaming). |
| 8 | +* Convert the streaming data to Delta Lake format. |
| 9 | +* Write the Delta Lake data to MinIO (AWS Object Storage). |
| 10 | +* Query the data with Trino. |
| 11 | +* Display the results in DBeaver. |
| 12 | + |
| 13 | +## System Architecture |
| 14 | + |
| 15 | +<p align = "center"> |
| 16 | + <img src="assets/architecture.png" alt="workflow"> |
| 17 | +</p> |
| 18 | + |
5 | 19 | ## Prequisites
|
6 | 20 |
|
| 21 | +Before runing this script, ensure you have the following installed.\ |
| 22 | +**Note**: The project was setup on Ubuntu 22.04 OS. |
| 23 | + |
| 24 | +* Ubuntu 22.04 (prefered, but you can use Ubuntu 20.04) |
| 25 | +* Python 3.10 |
| 26 | +* Apache Spark (installed locally) |
| 27 | +* Apache Airflow |
| 28 | +* Confluent Containers (Zookeeper, Kafka, Schema Registry, Connect, Control Center) |
| 29 | +* Docker |
| 30 | +* Minio |
| 31 | +* Trino, DBeaver CE |
| 32 | +* Delta Lake |
| 33 | +* Debezium, Debezium UI |
| 34 | + |
7 | 35 | ## Setup environments
|
8 | 36 |
|
| 37 | +1. **Clone the repository** |
| 38 | +```bash |
| 39 | +$ git clone https://github.com/VuBacktracking/stream-data-processing.git |
| 40 | +$ cd stream-data-processing |
| 41 | +``` |
| 42 | + |
| 43 | +2. **Start our data streaming infrastructure** |
| 44 | +```bash |
| 45 | +$ sudo service docker start |
| 46 | +$ docker compose -f storage-docker-compose.yaml -f stream-docker-compose.yaml up -d |
| 47 | +``` |
| 48 | + |
| 49 | +3. **Setup environment** |
| 50 | +```bash |
| 51 | +$ python3 -m venv .venv |
| 52 | +$ pip install -r requirements.txt |
| 53 | +``` |
| 54 | + |
| 55 | +Create `.env` file and paste your MINIO keys, SPARK_HOME in it. |
| 56 | +```ini |
| 57 | +# MinIO |
| 58 | +- MINIO_ACCESS_KEY='minio_access_key' |
| 59 | +- MINIO_SECRET_KEY='minio_secret_key' |
| 60 | +- MINIO_ENDPOINT='http://localhost:9000' |
| 61 | +- BUCKET_NAME='datalake' |
| 62 | + |
| 63 | +# Postgres SQL |
| 64 | +- POSTGRES_DB='v9' |
| 65 | +- POSTGRES_USER='v9' |
| 66 | +- POSTGRES_PASSWORD='v9' |
| 67 | + |
| 68 | +# Spark |
| 69 | +- SPARK_HOME="" |
| 70 | + |
| 71 | +# Data |
| 72 | +- TABLE_NAME="products" |
| 73 | +- DATA_FILE='./data/products.csv' |
| 74 | + |
| 75 | +``` |
| 76 | + |
| 77 | +4. **Services** |
| 78 | + |
| 79 | +* Postgres is accessible on the default port 5432. |
| 80 | +* Debezium UI: http://localhost:8080. |
| 81 | +* Kafka Control Center: http://localhost:9021. |
| 82 | +* Trino: http://localhost:8085. |
| 83 | +* MinIO: http://localhost:9001. |
| 84 | + |
9 | 85 | ## How to use?
|
| 86 | + |
| 87 | + |
0 commit comments