Skip to content

feat(rust/sedona-pointcloud): add optional round robin partitioning and parallel statistics extraction#648

Merged
paleolimbot merged 7 commits intoapache:mainfrom
b4l:tuning
Mar 2, 2026
Merged

feat(rust/sedona-pointcloud): add optional round robin partitioning and parallel statistics extraction#648
paleolimbot merged 7 commits intoapache:mainfrom
b4l:tuning

Conversation

@b4l
Copy link
Contributor

@b4l b4l commented Feb 20, 2026

This contains two optional features that greatly improve the performance of the LAS/LAZ listing table provider.

  • Round-robin partitioning: The default way to partition a dataset to enable parallel reading by DataFusion is through splitting files by byte ranges into the number of target partitions. For selective queries on (partially) ordered datasets that support pruning, this can result in unequal resource use, as all the work is done on one partition while the rest is pruned. Additionally, this breaks the existing locality in the input when it is converted, as data from all partitions ends up in each output row group. This approach addresses these issues by partitioning the dataset using a round-robin scheme across sequential chunks. This improves selective query performance by more than half.
  • Parallel statistics extraction: While the method to infer the schema, adopted from the Parquet reader, uses concurrency (metadata fetch concurrency), it is not parallel. Extracting statistics in parallel can substantially improve the extraction process by a factor of the number of cores available.

@b4l b4l changed the title feat(rust/sedona-pointcloud) add optional round robin partitioning and parallel statistics extraction feat(rust/sedona-pointcloud): add optional round robin partitioning and parallel statistics extraction Feb 20, 2026
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for continuing to work on this!

At a high level, I think DataFusion automatically applies round robin partitioning if it thinks that it will benefit the query plan. The built-in Parquet reader doesn't do this and I would be surprised if you need to explicitly do anything here unless I'm not understanding what is going on here.

This will also need tests that enable the various pieces you've added here. It would also benefit other members of the community to have a PR description with brief summary / justification of the functionality being added.

@paleolimbot
Copy link
Member

Ah, I see your point about the partitioning...the partitioning already occurred at the data source but there are just a lot of empty partitions. This is probably also an issue we have with the pruning in the GeoParquet reader (and perhaps DataFusion's built in Parquet reader since I just copied how they do pruning).

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a while to circle back to this...I had a release TODO list and lost track of a few things. Feel free to ping me if this happens again!

This looks great! I added some optional suggestions of where you put the nice text that you have in the PR description into the code for future readers.

}

fn repartitioned(
&self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some of the text you have in this PR to this section so that future readers have some background on why this is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c6f34a3

Comment on lines +83 to +85
pub parallel_statistics_extraction: bool, default = false
pub persist_statistics: bool, default = false
pub round_robin_partitioning: bool, default = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these would benefit from a brief summary docstring of when these values should be modified (e.g., use round robin partitioning when running queries with selective workloads that benefit from parallelization).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c6f34a3

@b4l
Copy link
Contributor Author

b4l commented Mar 2, 2026

@paleolimbot, no worries, I guess you have a full plate already. Added some documentation and reworked the options to be self-contained in the las module for now, which seems more concise.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit 1637efe into apache:main Mar 2, 2026
17 checks passed
@b4l b4l deleted the tuning branch March 3, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants