GitHub - AyariAhmed/spark-stuctured-streaming-pipeline

Spark structured streaming && kafka

Spark Structured Streaming is the new Spark stream processing approach, available from Spark 2.0 and stable from Spark 2.2. Spark Structured Streaming processing engine is built on the Spark SQL engine and both share the same high-level API.

Implemeted Solution:

Notes and instructions:

used versions : spark-2.4.0 and kafka_2.12-2.2.0

# After downloading the appropriate zipped filed

tar xzf spark-2.4.0-bin-hadoop2.7.tgz
ln -sf spark-2.4.0-bin-hadoop2.7 spark

tar xzf spark-2.4.0-bin-hadoop2.7.tgz
ln -sf spark-2.4.0-bin-hadoop2.7 spark

To set spark output to WARN

cp spark/conf/log4j.properties.template spark/conf/log4j.properties
sed -i -e 's/log4j.rootCategory=INFO/log4j.rootCategory=WARN/g' spark/conf/log4j.properties

Install project dependencies

source env/bin/activate # python virtual environment
pip install -r requirements.txt

start zookeeper server and kafka server,then create two kafka topics taxirides and taxifares

kafka/bin/zookeeper-server-start.sh -daemon kafka/config/zookeeper.properties
kafka/bin/kafka-server-start.sh -daemon kafka/config/server.properties
kafka/bin/kafka-topics.sh \
  --create --zookeeper localhost:2181 --replication-factor 1 \
  --partitions 1 --topic taxirides
kafka/bin/kafka-topics.sh \
  --create --zookeeper localhost:2181 --replication-factor 1 \
  --partitions 1 --topic taxifares

ingestions of the two datasets: pushed into a kafka producer which then will be consumed with a kafka consumer

( zcat data/nycTaxiFares.gz \
  | split -l 10000 --filter="kafka/bin/kafka-console-producer.sh \
    --broker-list localhost:9092 --topic taxifares; sleep 0.5" \
  > /dev/null ) &

( zcat data/nycTaxiRides.gz \
  | split -l 10000 --filter="kafka/bin/kafka-console-producer.sh \
    --broker-list localhost:9092 --topic taxirides; sleep 0.2" \
  > /dev/null ) &

Notice the difference in sleep time: since the rides dataset is larger than the fares dataset, they mustn't be consumed at the same rate otherwide stream-stream joins won't be possible. (10000 messages/event)

To verify that the data was successfully registered on the specified topics

kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 --topic taxirides --from-beginning
kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 --topic taxifares --from-beginning

Run the pipeline through:

spark/bin/spark-submit   --master local --driver-memory 4g   --num-executors 2 --executor-memory 4g   --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0  sstreaming-spark-out.py

(Data streams should be restarted when Batch 0 appears in the console)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
NYC_neighborhoods		NYC_neighborhoods
data		data
env		env
.gitignore		.gitignore
README.md		README.md
nbhd.jsonl		nbhd.jsonl
prep.py		prep.py
requirements.txt		requirements.txt
sstreaming-spark-out.py		sstreaming-spark-out.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark structured streaming && kafka

Implemeted Solution:

Notes and instructions:

Pipeline execution result:

About

Releases

Packages

Languages

AyariAhmed/spark-stuctured-streaming-pipeline

Folders and files

Latest commit

History

Repository files navigation

Spark structured streaming && kafka

Implemeted Solution:

Notes and instructions:

Pipeline execution result:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages