Skip to content

Commit 09c71ba

Browse files
update README.md
1 parent e270408 commit 09c71ba

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,86 @@
22

33
## Overview
44

5+
* Persist data to PostgreSQL.
6+
* Monitor changes to data using the Debezium Connector.
7+
* Stream data from a Kafka topic using PySpark (Spark Streaming).
8+
* Convert the streaming data to Delta Lake format.
9+
* Write the Delta Lake data to MinIO (AWS Object Storage).
10+
* Query the data with Trino.
11+
* Display the results in DBeaver.
12+
13+
## System Architecture
14+
15+
<p align = "center">
16+
<img src="assets/architecture.png" alt="workflow">
17+
</p>
18+
519
## Prequisites
620

21+
Before runing this script, ensure you have the following installed.\
22+
**Note**: The project was setup on Ubuntu 22.04 OS.
23+
24+
* Ubuntu 22.04 (prefered, but you can use Ubuntu 20.04)
25+
* Python 3.10
26+
* Apache Spark (installed locally)
27+
* Apache Airflow
28+
* Confluent Containers (Zookeeper, Kafka, Schema Registry, Connect, Control Center)
29+
* Docker
30+
* Minio
31+
* Trino, DBeaver CE
32+
* Delta Lake
33+
* Debezium, Debezium UI
34+
735
## Setup environments
836

37+
1. **Clone the repository**
38+
```bash
39+
$ git clone https://github.com/VuBacktracking/stream-data-processing.git
40+
$ cd stream-data-processing
41+
```
42+
43+
2. **Start our data streaming infrastructure**
44+
```bash
45+
$ sudo service docker start
46+
$ docker compose -f storage-docker-compose.yaml -f stream-docker-compose.yaml up -d
47+
```
48+
49+
3. **Setup environment**
50+
```bash
51+
$ python3 -m venv .venv
52+
$ pip install -r requirements.txt
53+
```
54+
55+
Create `.env` file and paste your MINIO keys, SPARK_HOME in it.
56+
```ini
57+
# MinIO
58+
- MINIO_ACCESS_KEY='minio_access_key'
59+
- MINIO_SECRET_KEY='minio_secret_key'
60+
- MINIO_ENDPOINT='http://localhost:9000'
61+
- BUCKET_NAME='datalake'
62+
63+
# Postgres SQL
64+
- POSTGRES_DB='v9'
65+
- POSTGRES_USER='v9'
66+
- POSTGRES_PASSWORD='v9'
67+
68+
# Spark
69+
- SPARK_HOME=""
70+
71+
# Data
72+
- TABLE_NAME="products"
73+
- DATA_FILE='./data/products.csv'
74+
75+
```
76+
77+
4. **Services**
78+
79+
* Postgres is accessible on the default port 5432.
80+
* Debezium UI: http://localhost:8080.
81+
* Kafka Control Center: http://localhost:9021.
82+
* Trino: http://localhost:8085.
83+
* MinIO: http://localhost:9001.
84+
985
## How to use?
86+
87+

0 commit comments

Comments
 (0)