[SPARK-54335][SQL] Reducing skew in the number of file splits per partition #53040
+146
−17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Currently, when Spark partitions the input files in a table scan, it first sorts the input splits, then adjacent splits are coalesced into a single partition. If the input split size distribution is uneven, some partitions will have only a few splits while others will have many.
We observed that this file partitioning strategy can slow down the reading process if some tasks are reading more files than others, especially in Gluten’s native Parquet reader. To address this performance issue, we are proposing a new partitioning strategy that takes both partition size and file count into account and distributes the small files across different partitions to avoid skew.
The strategy is designed with following steps:
spark.sql.files.maxPartitionNumis set, use the smaller one as the output partition number.The total size of small files can be configured using spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.
Why are the changes needed?
As described in the previous section. End users and projects like Apache Gluten can benefit from this change.
Does this PR introduce any user-facing change?
new configurations are added for this enhancement:
How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
No