You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we currently have several Kubernetes clusters that receive and execute jobs with the Spark-operator. We have created Spark image (3.5.4) with Delta (3.3.0 - Scala 2.12) and implemented a simple structured streaming pipeline there. This simple implementation reads from (Source) an EventHub and writes the data to disk in (Sink) Delta format at an interval of 5 seconds (mode: append).
The following problem: If we start the job for the first time (fresh start), the processing of the batch takes about 1 second (see screenshot).
If we wait one day, the runtime changes to 9 seconds (see screenshot).
After few days it settles down. We have set the following SQL configuration to minimize any overhead caused by shuffle or snapshots (see screenshot).
We have already ported the application to Pyspark/Scala and tested it in an AKS (Azure) and on a special K8s hardware. Unfortunately, we have the same behavior. We have also compared the DAG between Native Spark and Databricks (Identical Structured Streaming Application) and found differences (Read JSON from DISK vs. Cache) + SQL Subqueries not available). We also tested the storage by using simple parquet as a sink (there was no problem). Why is not clear to us.
Hard Fact:
In Sparks structured streaming UI, we have identified that addBatch requires the most time. The duration of the spark sql looks wired. Every root quries takes a huge amount of time:
Unfortunately, we don't know what to do at the moment. It would be nice if someone had an idea. I would be happy to provide further information.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
we currently have several Kubernetes clusters that receive and execute jobs with the Spark-operator. We have created Spark image (3.5.4) with Delta (3.3.0 - Scala 2.12) and implemented a simple structured streaming pipeline there. This simple implementation reads from (Source) an EventHub and writes the data to disk in (Sink) Delta format at an interval of 5 seconds (mode: append).
log(“Start Bronze Streaming ”)
query = (
stream_df.writeStream
.queryName(stream_name)
.format(“delta”)
.partitionBy(“datetime”)
.trigger(processingTime=“5 seconds”)
.option(“checkpointLocation”, check_path)
.start(data_path)
)
The following problem: If we start the job for the first time (fresh start), the processing of the batch takes about 1 second (see screenshot).
If we wait one day, the runtime changes to 9 seconds (see screenshot).
After few days it settles down. We have set the following SQL configuration to minimize any overhead caused by shuffle or snapshots (see screenshot).
We have already ported the application to Pyspark/Scala and tested it in an AKS (Azure) and on a special K8s hardware. Unfortunately, we have the same behavior. We have also compared the DAG between Native Spark and Databricks (Identical Structured Streaming Application) and found differences (Read JSON from DISK vs. Cache) + SQL Subqueries not available). We also tested the storage by using simple parquet as a sink (there was no problem). Why is not clear to us.
Hard Fact:
In Sparks structured streaming UI, we have identified that addBatch requires the most time. The duration of the spark sql looks wired. Every root quries takes a huge amount of time:
Unfortunately, we don't know what to do at the moment. It would be nice if someone had an idea. I would be happy to provide further information.
Greetings
Beta Was this translation helpful? Give feedback.
All reactions