DATA STREAM PROCESSING

Overview

Persist data to PostgreSQL.
Monitor changes to data using the Debezium Connector.
Stream data from a Kafka topic using PySpark (Spark Streaming).
Convert the streaming data to Delta Lake format.
Write the Delta Lake data to MinIO (AWS Object Storage).
Query the data with Trino.
Display the results in DBeaver.

System Architecture

Prequisites

Before runing this script, ensure you have the following installed.
Note: The project was setup on Ubuntu 22.04 OS.

Ubuntu 22.04 (prefered, but you can use Ubuntu 20.04)
Python 3.10
Apache Spark (installed locally)
Apache Airflow
Confluent Containers (Zookeeper, Kafka, Schema Registry, Connect, Control Center)
Docker
Minio
Trino, DBeaver CE
Delta Lake
Debezium, Debezium UI

Start

Clone the repository

$ git clone https://github.com/VuBacktracking/stream-data-processing.git
$ cd stream-data-processing

Start our data streaming infrastructure

$ sudo service docker start
$ docker compose -f storage-docker-compose.yaml -f stream-docker-compose.yaml up -d

Setup environment

$ python3 -m venv .venv
$ pip install -r requirements.txt

Create .env file and paste your MINIO keys, SPARK_HOME in it.

# MinIO
- MINIO_ACCESS_KEY='minio_access_key'
- MINIO_SECRET_KEY='minio_secret_key'
- MINIO_ENDPOINT='http://localhost:9000'
- BUCKET_NAME='datalake'

# Postgres SQL
- POSTGRES_DB='v9'
- POSTGRES_USER='v9'
- POSTGRES_PASSWORD='v9'

# Spark
- SPARK_HOME=""

Services

Postgres is accessible on the default port 5432.
Debezium UI: http://localhost:8085.
Kafka Control Center: http://localhost:9021.
Trino: http://localhost:8084.
MinIO: http://localhost:9001.

How to use?

Step 1. Start Debezium Connection

cd debezium
bash run-cdc.sh register_connector conf/products-cdc-config.json

You should see the connection is running like the image below in the port http://localhost:8085.

Step 2. Create table and insert data into Database

python3 database-operations/create_table.py
python3 database-operations/insert_table.py

In the PostgreSQL connection, you should see the database v9 and the table products like the image below.

Step 3. Start Streaming Data to MinIO

python3 stream_processing/delta-to-minio.py

After putting data to MinIO storage, you can go to the port http://localhost:9001 and see the result like this image

Read streaming data with Trino and Dbeaver

Connect Trino in Dbeaver

Query with Dbeaver

Create your Trino schema and table in Dbeaver

-- Create the schema if it doesn't exist
CREATE SCHEMA IF NOT EXISTS lakehouse.products
WITH (location = 's3://datalake/');

-- Create the products table
CREATE TABLE IF NOT EXISTS lakehouse.products.products (
    id VARCHAR,
    name VARCHAR,
    original_price DOUBLE,
    price DOUBLE,
    fulfillment_type VARCHAR,
    brand VARCHAR,
    review_count INTEGER,
    rating_average DOUBLE,
    favourite_count INTEGER,
    current_seller VARCHAR,
    number_of_images INTEGER,
    category VARCHAR,
    quantity_sold INTEGER,
    discount DOUBLE
) WITH (
    location = 's3://datalake/products/'
);

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
airflow		airflow
assets		assets
data-ingestion		data-ingestion
data		data
database-operations		database-operations
debezium		debezium
stream_processing		stream_processing
trino		trino
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
storage-docker-compose.yaml		storage-docker-compose.yaml
stream-docker-compose.yaml		stream-docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DATA STREAM PROCESSING

Overview

System Architecture

Prequisites

Start

How to use?

Read streaming data with Trino and Dbeaver

Connect Trino in Dbeaver

Query with Dbeaver

About

Uh oh!

Releases

Packages

Uh oh!

Languages

VuBacktracking/stream-data-processing

Folders and files

Latest commit

History

Repository files navigation

DATA STREAM PROCESSING

Overview

System Architecture

Prequisites

Start

How to use?

Read streaming data with Trino and Dbeaver

Connect Trino in Dbeaver

Query with Dbeaver

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages