chore(rust/sedona-spatial-join): Revamp memory reservation pattern for spatial join#534
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the memory reservation pattern for spatial joins by moving memory planning earlier in the build-side collection flow. The change eliminates the concurrent reservation wrapper and shifts from incremental memory tracking during probing to upfront estimation during collection.
Changes:
- Removed the
ConcurrentReservationwrapper and associated reservation logic during probe operations - Introduced upfront memory estimation for spatial index components (R-tree, refiner, KNN components)
- Added
estimate_max_memory_usagemethod to refiners for pre-allocation planning
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/utils/concurrent_reservation.rs | Removed entire concurrent reservation wrapper implementation |
| rust/sedona-spatial-join/src/utils.rs | Removed concurrent_reservation module export |
| rust/sedona-spatial-join/src/refine/tg.rs | Added estimate_max_memory_usage method for TG refiner |
| rust/sedona-spatial-join/src/refine/geos.rs | Added estimate_max_memory_usage method for GEOS refiner |
| rust/sedona-spatial-join/src/refine/geo.rs | Added estimate_max_memory_usage method for Geo refiner |
| rust/sedona-spatial-join/src/refine.rs | Added trait method for estimating maximum memory usage |
| rust/sedona-spatial-join/src/index/spatial_index_builder.rs | Refactored to use upfront memory estimation and removed incremental reservation growth |
| rust/sedona-spatial-join/src/index/spatial_index.rs | Removed refiner reservation tracking during query operations |
| rust/sedona-spatial-join/src/index/knn_adapter.rs | Changed from reservation-based to estimation-based memory tracking |
| rust/sedona-spatial-join/src/index/build_side_collector.rs | Added upfront memory reservation based on estimated usage |
| rust/sedona-spatial-join/src/build_index.rs | Updated collector initialization to pass spatial predicate and options |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ad5e50b to
cc5dd4b
Compare
| // TODO: This is a rough estimate of the memory usage of the tg geometry and | ||
| // may not be accurate. | ||
| // https://github.com/apache/sedona-db/issues/281 | ||
| build_stats.total_size_bytes().unwrap_or(0) as usize * 2 |
There was a problem hiding this comment.
A tg geometry can be queried for its memory size. Is that relevant here?
There was a problem hiding this comment.
We prefer to estimate the memory needed without the overhead of creating tg geom for each geometry, so we calculated the estimated size instead of measure the real size.
The estimation here is accurate enough in practice. I have also run some experiments using tg_geom_size to obtain a better formula for more accurate estimation. I'll submit a patch to refine this formula separately.
This change shifts spatial join memory planning earlier in the build-side collection flow, simplifying reservation handling while improving reliability around DataFusion’s reservation behavior. We only reserve memory when collecting the build side batches for building (probably partitioned) spatial indexes and don't reserve memory when probing the index and producing result batches.
Note: reserving memory while producing batches can trigger DataFusion reservation failures (see https://github.com/apache/datafusion/issues/17334).