feat(rust/sedona-spatial-join) Spatial index supports async batch query and parallel refinement#523
Merged
Kontinuation merged 2 commits intoapache:mainfrom Jan 17, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces async batch query support for the spatial index to address performance bottlenecks during candidate refinement in spatial join operations, particularly for queries with large windows on dense datasets.
Changes:
- Added async batch query interface (
query_batch) to theSpatialIndexthat splits massive refinement workloads into smaller parallel tasks - Introduced parallel refinement capability with configurable chunk size to distribute candidate evaluation across multiple async tasks
- Modified
SpatialJoinStreamandSpatialJoinBatchIteratorto support asynchronous processing of probe batches
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/stream.rs | Refactored join stream to use async batch processing with futures and updated iterator to handle concurrent refinement |
| rust/sedona-spatial-join/src/index/spatial_index_builder.rs | Added options field to SpatialIndex during construction |
| rust/sedona-spatial-join/src/index/spatial_index.rs | Implemented async query_batch method with parallel refinement support and comprehensive test coverage |
| rust/sedona-spatial-join/src/exec.rs | Added integration test for parallel refinement with large candidate sets |
| rust/sedona-common/src/option.rs | Added parallel_refinement_chunk_size configuration option |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Kontinuation
added a commit
that referenced
this pull request
Jan 20, 2026
… smaller ones (#525) This is a follow up of #523. When executing queries with large windows on dense datasets, each probe row may be matched with millions of indexed rows. If we don't break large result batches generated by such index probing, we'll easily overshoot the memory limit when assembling join result batches. This patch splits large joined build-probe side indices into smaller pieces and gradually assemble result batches. This will greatly reduce the amount of memory required for producing join results for "cover all" probe rows. The code for properly slicing join result indices for various join types is a bit complicated. We have added fuzz tests to verify that it works correctly. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses performance bottlenecks (stragglers) observed during the candidate refinement phase of SpatialBench Q10 and Q11, particularly at higher scale factors (SF=100 and SF=1000).
When executing queries with large windows on dense datasets, a single R-Tree index query can retrieve millions of candidates. The probe partition becomes a "straggler" because it must sequentially evaluate spatial predicates for these millions of geometries. Since this bottleneck occurs within a single partition, DataFusion’s partition-level parallelism is unable to distribute the load.
This patch introduced an async batch query interface for SpatialIndex. This allows the engine to split massive refinement workloads into smaller tasks, which are then executed in parallel by an async runtime. This amortizes scheduling costs of async function calls and eliminates the single-partition bottleneck.
UPDATE: Running SpatialBench Q11 SF=10 locally took 31s after applying this patch (51s before). This optimization is indeed effective for targeted cases.