feat(rust/sedona-spatial-join): Support partitioned KNN join to handle larger than memory object side#573
feat(rust/sedona-spatial-join): Support partitioned KNN join to handle larger than memory object side#573Kontinuation wants to merge 3 commits intoapache:mainfrom
Conversation
| } | ||
|
|
||
| fn partition_no_multi(&self, _bbox: &BoundingBox) -> Result<SpatialPartition> { | ||
| let idx = self.counter.fetch_add(1, Ordering::Relaxed); |
There was a problem hiding this comment.
This round robin partitioner is nondeterministic due to the order of concurrent tasks being scheduled. We will address this by making all partitioners non-sync, so that each async task will have its own partitioner with its own mutable internal state. This will be a relatively large refactoring so I'll leave it to the next PR.
There was a problem hiding this comment.
Pull request overview
This PR adds support for partitioned KNN (K-Nearest Neighbors) spatial joins to handle object sides larger than available memory by splitting the object data into smaller partitions and maintaining nearest-so-far results across partitions.
Changes:
- Implements a KNN results merger that spills intermediate results to disk and merges them across partitions
- Adds round-robin and broadcast partitioners for distributing KNN join data
- Updates query methods to track and filter distances alongside join indices
- Adds comprehensive fuzz testing for the merger correctness
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
rust/sedona-spatial-join/src/probe/knn_results_merger.rs |
New implementation of KNN results merger with spilling and merging logic |
rust/sedona-spatial-join/src/stream.rs |
Integration of KNN results merger into the spatial join stream processing |
rust/sedona-spatial-join/src/utils/join_utils.rs |
Extended filter function to handle distance filtering |
rust/sedona-spatial-join/src/partitioning/round_robin.rs |
New round-robin partitioner for even data distribution |
rust/sedona-spatial-join/src/partitioning/broadcast.rs |
New broadcast partitioner for probe side distribution |
rust/sedona-spatial-join/src/prepare.rs |
Updated to use new partitioners for KNN joins |
rust/sedona-spatial-join/src/index/spatial_index.rs |
Added distance tracking to KNN query methods |
rust/sedona-spatial-join/src/exec.rs |
Added comprehensive tests for partitioned KNN joins |
rust/sedona-spatial-join/src/probe.rs |
Module registration for KNN results merger |
rust/sedona-spatial-join/src/partitioning.rs |
Module registration for new partitioners |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This patch breaks the object size of KNN(query, object, K) into smaller partitions, and spills results of KNN queries to a set of spill files. We merge the KNN query result with the nearest-so-far results we obtained when processing the previous partition. The global KNN result will be produced after all partitions were processed.
The tricky part for spilling and merging nearest-so-far KNN query results is implemented in knn_result_merger.rs. It needs to handle lots of edge cases correctly and we were very cautious when implementing this. Fuzz tests were added to verify the correctness of the KNN result merger.