Native Spark with Delta Sink is getting steadily slower #4235

haahen · 2025-03-06T12:13:27Z

haahen
Mar 6, 2025

Hello,

we currently have several Kubernetes clusters that receive and execute jobs with the Spark-operator. We have created Spark image (3.5.4) with Delta (3.3.0 - Scala 2.12) and implemented a simple structured streaming pipeline there. This simple implementation reads from (Source) an EventHub and writes the data to disk in (Sink) Delta format at an interval of 5 seconds (mode: append).

log(“Start Bronze Streaming ”)
query = (
stream_df.writeStream
.queryName(stream_name)
.format(“delta”)
.partitionBy(“datetime”)
.trigger(processingTime=“5 seconds”)
.option(“checkpointLocation”, check_path)
.start(data_path)
)

The following problem: If we start the job for the first time (fresh start), the processing of the batch takes about 1 second (see screenshot).

If we wait one day, the runtime changes to 9 seconds (see screenshot).

After few days it settles down. We have set the following SQL configuration to minimize any overhead caused by shuffle or snapshots (see screenshot).

We have already ported the application to Pyspark/Scala and tested it in an AKS (Azure) and on a special K8s hardware. Unfortunately, we have the same behavior. We have also compared the DAG between Native Spark and Databricks (Identical Structured Streaming Application) and found differences (Read JSON from DISK vs. Cache) + SQL Subqueries not available). We also tested the storage by using simple parquet as a sink (there was no problem). Why is not clear to us.

Hard Fact:

In Sparks structured streaming UI, we have identified that addBatch requires the most time. The duration of the spark sql looks wired. Every root quries takes a huge amount of time:

Unfortunately, we don't know what to do at the moment. It would be nice if someone had an idea. I would be happy to provide further information.

Greetings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Spark with Delta Sink is getting steadily slower #4235

{{title}}

Replies: 0 comments

Select a reply

Native Spark with Delta Sink is getting steadily slower #4235

haahen Mar 6, 2025

Replies: 0 comments

haahen
Mar 6, 2025