Skip to content

feat(rust/sedona-spatial-join): Support spatial partitioned spatial join for non-knn predicates#563

Merged
Kontinuation merged 4 commits intoapache:mainfrom
Kontinuation:pr-partitioned-sj
Feb 4, 2026
Merged

feat(rust/sedona-spatial-join): Support spatial partitioned spatial join for non-knn predicates#563
Kontinuation merged 4 commits intoapache:mainfrom
Kontinuation:pr-partitioned-sj

Conversation

@Kontinuation
Copy link
Member

This patch adds a probe-side partitioned stream provider for repartitioning the probe side, and create probe streams for specified partitions. This partitioned stream provider is integrated into the spatial join execution flow to support larger-than-memory dataset by breaking the data into smaller partitions, where each partition could be fully loaded into memory.

Currently only non-knn joins are supported. We'll add larger-than-memory KNN join support in subsequent patches.

@Kontinuation Kontinuation requested a review from Copilot January 30, 2026 18:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements spatial partitioned stream processing for non-KNN spatial joins, enabling joins to handle larger-than-memory datasets by partitioning data into smaller chunks that can be loaded into memory. The main additions include a partitioned probe stream provider, updated stream state management for processing multiple partitions sequentially, and bitmap tracking for deduplicating probe rows across partitions in outer joins.

Changes:

  • Added probe-side partitioned stream infrastructure to repartition and spill probe data
  • Extended spatial join stream state machine to iterate through multiple spatial partitions
  • Implemented bitmap tracking for probe-side Multi partition to handle right outer joins correctly

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
rust/sedona-spatial-join/src/utils/join_utils.rs Added functions for tracking visited probe rows across partitions and adjusting indices with bitmap information
rust/sedona-spatial-join/src/stream.rs Extended state machine to process multiple spatial partitions sequentially with proper bitmap tracking
rust/sedona-spatial-join/src/probe/partitioned_stream_provider.rs New probe stream provider for repartitioning and spilling probe data to disk
rust/sedona-spatial-join/src/probe/non_partitioned_stream.rs Wrapper stream for non-partitioned probe processing with metrics
rust/sedona-spatial-join/src/probe/first_pass_stream.rs Stream implementation for first-pass processing that splits and spills probe data
rust/sedona-spatial-join/src/probe.rs Module definition with shared probe metrics structure
rust/sedona-spatial-join/src/prepare.rs Refactored to create probe stream options and spatial partitioners for probe side
rust/sedona-spatial-join/src/operand_evaluator.rs Minor style improvement in conditional logic
rust/sedona-spatial-join/src/lib.rs Added probe module to library exports
rust/sedona-spatial-join/src/index/partitioned_index_provider.rs Exposed wait_for_index method for public use
rust/sedona-spatial-join/src/exec.rs Updated to pass session config and runtime environment to stream constructor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Kontinuation Kontinuation marked this pull request as ready for review January 31, 2026 04:34
@Kontinuation Kontinuation changed the title WIP: feat(rust/sedona-spatial-join): Support spatial partitioned spatial join for non-knn predicates feat(rust/sedona-spatial-join): Support spatial partitioned spatial join for non-knn predicates Jan 31, 2026
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 !

Comment on lines 367 to 372
fn point_wkb(x: f64, y: f64) -> Vec<u8> {
let mut buf = vec![1u8, 1, 0, 0, 0];
buf.extend_from_slice(&x.to_le_bytes());
buf.extend_from_slice(&y.to_le_bytes());
buf
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test helper appears in many of the test files and I believe sedona_geometry::wkb_factory has it as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all point_wkb functions and replaced them using wkb_point.

Comment on lines +374 to +379
fn rect_wkb(min_x: f64, min_y: f64, max_x: f64, max_y: f64) -> Vec<u8> {
let mut buf = Vec::with_capacity(1 + 4 + 4 + 4 + 5 * 16);
buf.push(1u8);
buf.extend_from_slice(&3u32.to_le_bytes());
buf.extend_from_slice(&1u32.to_le_bytes());
buf.extend_from_slice(&5u32.to_le_bytes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also appears in at least one other test file and if it isn't in sedona_geometry it certainly could be added.

Feel free to open a follow-up for this (deduplicate test utilities in sedona-spatial join, or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a wkb_rect in wkb_factory. This could be useful in lots of places. I'll submit a subsequent PR for this and do some cleaning.

Comment on lines +119 to +120
left_indices: UInt64Array,
right_indices: UInt32Array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that left and right indices are not the same type? (Are left indices global and right indices local to a batch?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. This is a variant of DataFusion's append_right_indices, the types of left_indices and right_indices were unchanged.

Comment on lines +147 to +150
let unmatched_count = adjust_range
.clone()
.filter(|&i| !bitmap.get_bit(i + offset))
.count();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this is a bottleneck, but calling get_bit() in a loop is not particularly fast. I am not sure if arrow-rs provides an optimized iterator for this operation (in Arrow C++ there are a few options like VisitSetBitRuns that do a better job).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately Arrow's BooleanBufferBuilder does not provide optimized methods for iterating over bit ranges, other join_utils code inherited from DataFusion also did this, so I sticked to using BooleanBufferBuilder for visited bitset to be consistent with the rest of the code.

It didn't show up as a performance bottleneck when running outer joins before, perhaps the other parts of the join is far more heavy weight than bitmap traversal.

@Kontinuation Kontinuation merged commit 7a1df55 into apache:main Feb 4, 2026
15 checks passed
Kontinuation added a commit that referenced this pull request Feb 4, 2026
… their own rect_wkb functions (#572)

This is a follow up of #563 (comment). Rect WKBs are useful in various places so we'd better define a utility function for creating them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants