[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040

marin-ma · 2025-11-13T17:22:08Z

What changes were proposed in this pull request?

Currently, when Spark partitions the input files in a table scan, it first sorts the input splits, then adjacent splits are coalesced into a single partition. If the input split size distribution is uneven, some partitions will have only a few splits while others will have many.

We observed that this file partitioning strategy can slow down the reading process if some tasks are reading more files than others, especially in Gluten’s native Parquet reader. To address this performance issue, we are proposing a new partitioning strategy that takes both partition size and file count into account and distributes the small files across different partitions to avoid skew.

The strategy is designed with following steps:

Get the number of output partition number from Spark's original logic FilePartition.getFilePartitions. If spark.sql.files.maxPartitionNum is set, use the smaller one as the output partition number.
Assign small files starting from the smallest to the partitions with the minimum file count + total file size strategy
Assign the remaining files from the largest into the partition with the minimum total file size + file count strategy
The total size of small files can be configured using spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.

Why are the changes needed?

As described in the previous section. End users and projects like Apache Gluten can benefit from this change.

Does this PR introduce any user-facing change?

new configurations are added for this enhancement:

spark.sql.files.partitionStrategy
spark.sql.files.smallFileThreshold

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added the SQL label Nov 13, 2025

add partition by file num strategy

4cadadc

marin-ma force-pushed the partition-by-size branch from cdebb79 to 4cadadc Compare November 13, 2025 17:23

fix up

9884981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040

[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040

marin-ma commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040

Are you sure you want to change the base?

[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040

Conversation

marin-ma commented Nov 13, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant