Skip to content

Conversation

@marin-ma
Copy link
Contributor

What changes were proposed in this pull request?

Currently, when Spark partitions the input files in a table scan, it first sorts the input splits, then adjacent splits are coalesced into a single partition. If the input split size distribution is uneven, some partitions will have only a few splits while others will have many.

We observed that this file partitioning strategy can slow down the reading process if some tasks are reading more files than others, especially in Gluten’s native Parquet reader. To address this performance issue, we are proposing a new partitioning strategy that takes both partition size and file count into account and distributes the small files across different partitions to avoid skew.

The strategy is designed with following steps:

  1. Get the number of output partition number from Spark's original logic FilePartition.getFilePartitions. If spark.sql.files.maxPartitionNum is set, use the smaller one as the output partition number.
  2. Assign small files starting from the smallest to the partitions with the minimum file count + total file size strategy
  3. Assign the remaining files from the largest into the partition with the minimum total file size + file count strategy
    The total size of small files can be configured using spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.

Why are the changes needed?

As described in the previous section. End users and projects like Apache Gluten can benefit from this change.

Does this PR introduce any user-facing change?

new configurations are added for this enhancement:

  • spark.sql.files.partitionStrategy
  • spark.sql.files.smallFileThreshold

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant