feat(rust/sedona-pointcloud): add optional round robin partitioning and parallel statistics extraction#648
Conversation
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for continuing to work on this!
At a high level, I think DataFusion automatically applies round robin partitioning if it thinks that it will benefit the query plan. The built-in Parquet reader doesn't do this and I would be surprised if you need to explicitly do anything here unless I'm not understanding what is going on here.
This will also need tests that enable the various pieces you've added here. It would also benefit other members of the community to have a PR description with brief summary / justification of the functionality being added.
|
Ah, I see your point about the partitioning...the partitioning already occurred at the data source but there are just a lot of empty partitions. This is probably also an issue we have with the pruning in the GeoParquet reader (and perhaps DataFusion's built in Parquet reader since I just copied how they do pruning). |
paleolimbot
left a comment
There was a problem hiding this comment.
Sorry it took me a while to circle back to this...I had a release TODO list and lost track of a few things. Feel free to ping me if this happens again!
This looks great! I added some optional suggestions of where you put the nice text that you have in the PR description into the code for future readers.
| } | ||
|
|
||
| fn repartitioned( | ||
| &self, |
There was a problem hiding this comment.
Can you add some of the text you have in this PR to this section so that future readers have some background on why this is necessary?
| pub parallel_statistics_extraction: bool, default = false | ||
| pub persist_statistics: bool, default = false | ||
| pub round_robin_partitioning: bool, default = false |
There was a problem hiding this comment.
All of these would benefit from a brief summary docstring of when these values should be modified (e.g., use round robin partitioning when running queries with selective workloads that benefit from parallelization).
|
@paleolimbot, no worries, I guess you have a full plate already. Added some documentation and reworked the options to be self-contained in the las module for now, which seems more concise. |
This contains two optional features that greatly improve the performance of the LAS/LAZ listing table provider.