Disable optimized spatial join and GeoParquet data pruning for geography types#57
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR disables optimized spatial join and GeoParquet data pruning for geography types to prevent incorrect query results. The optimization currently only works correctly with planar geometry types.
- Adds type checking to ensure spatial join optimizations only apply to geometry types, not geography types
- Disables GeoParquet data pruning for geography fields by removing bbox information from statistics
- Includes comprehensive tests to verify geography types are not optimized
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/optimizer.rs | Adds type validation to prevent spatial join optimization for geography types |
| rust/sedona-spatial-join/src/exec.rs | Updates test utilities and adds geography join test to verify optimization is disabled |
| rust/sedona-geoparquet/src/file_opener.rs | Disables data pruning for geography fields by filtering out bbox statistics |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for catching this!
A few things to consider now (or create an issue so we can follow up!)
| if is_prunable_geospatial_field(field) { | ||
| Ok(column_metadata.to_geo_statistics()) | ||
| } else { | ||
| // Bounding box based pruning does not work for geography fields, so we remove | ||
| // the bbox from statistics to ensure that they are not used for pruning. | ||
| Ok(column_metadata.to_geo_statistics().with_bbox(None)) | ||
| } |
There was a problem hiding this comment.
Should we handle this in literal_bounds() by returning a Full bounding box?
sedona-db/rust/sedona-expr/src/spatial_filter.rs
Lines 317 to 319 in 3e9b7a9
(Strictly speaking there's nothing wrong with the information in the file, it's just that we can't bound the literal)
There was a problem hiding this comment.
I found that we cannot return a full bounding box for geography. We are pushing down ST_Contains and ST_Covers predicates as SpatialFilter::Covers, which requires the bbox of the geospatial column to contain the query window. Returning a full bounding box for geography will skip everything in such cases.
I'll try a different approach. The modification will still go into spatial_filter.rs and keep the bboxes in geo-statistics retrieved from GeoParquet files intact.
| #[test] | ||
| fn test_is_spatial_predicate_supported() { | ||
| // Planar geometry field | ||
| let geom_field = WKB_GEOMETRY.to_storage_field("geom", false).unwrap(); | ||
| let schema = Arc::new(Schema::new(vec![geom_field.clone()])); |
There was a problem hiding this comment.
This would benefit from a high-level integration test in Python (i.e., make sure that a join with geography returns correct results). A join between "submodules/geoarrow-data/natural-earth/files/natural-earth_countries-geography_geo.parquet" and "submodules/geoarrow-data/natural-earth/files/natural-earth_countries-geography_geo.parquet" might be a good candidate.
There was a problem hiding this comment.
I tried to add this test case, but found that SedonaDB gives wildly different results from PostGIS. S2 is less likely to consider polygons that touch at boundaries as intersections. I can still add this Python test, but mark it as xfail.
|
I tried to use Here is the file level geo metadata of that Parquet file: {"version": "1.0.0", "primary_column": "geometry", "columns": {"geometry": {"encoding": "WKB", "geometry_types": [], "crs": {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84 (CRS84)", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)", "id": {"authority": "EPSG", "code": 1166}}, {"name": "World Geodetic System 1984 (G730)", "id": {"authority": "EPSG", "code": 1152}}, {"name": "World Geodetic System 1984 (G873)", "id": {"authority": "EPSG", "code": 1153}}, {"name": "World Geodetic System 1984 (G1150)", "id": {"authority": "EPSG", "code": 1154}}, {"name": "World Geodetic System 1984 (G1674)", "id": {"authority": "EPSG", "code": 1155}}, {"name": "World Geodetic System 1984 (G1762)", "id": {"authority": "EPSG", "code": 1156}}, {"name": "World Geodetic System 1984 (G2139)", "id": {"authority": "EPSG", "code": 1309}}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}, {"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}]}, "scope": "Not known.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "OGC", "code": "CRS84"}}, "edges": "spherical"}}}The bbox field is: "bbox": {
"south_latitude": -90,
"west_longitude": -180,
"north_latitude": 90,
"east_longitude": 180
} |
2508be6 to
d7809f6
Compare
python/sedonadb/tests/test_sjoin.py
Outdated
| @pytest.mark.xfail(reason="https://github.com/apache/sedona-db/issues/63") | ||
| def test_spatial_join_geography(geoarrow_data): | ||
| eng_sedonadb = SedonaDB.create_or_skip() | ||
| eng_postgis = PostGIS.create_or_skip() | ||
|
|
||
| eng_sedonadb.create_table_parquet( | ||
| "sjoin_geog_all", | ||
| geoarrow_data | ||
| / "natural-earth/files/natural-earth_countries-geography_geo.parquet", | ||
| ) | ||
| test_data = eng_sedonadb.execute_and_collect( | ||
| "SELECT * FROM sjoin_geog_all LIMIT 100" | ||
| ) | ||
| eng_sedonadb.create_table_arrow("sjoin_geog", test_data) | ||
| eng_postgis.create_table_arrow("sjoin_geog", test_data) |
There was a problem hiding this comment.
I pasted this into the issue so we can track it there...in the meantime, we can either remove this or pick a less hard join (e.g., pick a small number of hard-coded or randomly generated points). This test is mostly just making sure that it works at all.
There was a problem hiding this comment.
I switched to using randomly generated data for this geography join test. The bounds of generated data are disjoint on planar surface but intersects the antimeridian on spherical surface. This distribution of data ensures that we won't pass the test when handling geography objects as planar geometry objects.
Thank you for tracking that down! I think that's the bbox of the CRS and not the top-level column metadata? (In any case there are a few battles before the behaviour of geography is sorted I think!) |
This PR fixes a regression introduced by #57, which added geography type checking to prevent optimized spatial joins from being used with unsupported geography types. The problematic patch incorrectly matches `left` and `right` expressions in `KNNPredicate` to query plans. Actually, how `left` and `right` maps to the original query plans depends on the value of `probe_side` field. This is quite misleading and may be addressed in a future PR by renaming `left` and `right` to `build` and `probe`. The function for checking if geospatial types are supported now returns `Result<bool>` to propagate errors during the field type extraction and matching process. This prevents silent matching failures due to bugs from happening.
As mentioned in #39, our GeoParquet data pruning optimization and spatial join optimization does not handle geography types correctly, so we'd better disable them to avoid generating incorrect query results.