Skip to content

Disable optimized spatial join and GeoParquet data pruning for geography types#57

Merged
Kontinuation merged 8 commits intoapache:mainfrom
Kontinuation:disable-sj-and-pruning-for-geog
Sep 12, 2025
Merged

Disable optimized spatial join and GeoParquet data pruning for geography types#57
Kontinuation merged 8 commits intoapache:mainfrom
Kontinuation:disable-sj-and-pruning-for-geog

Conversation

@Kontinuation
Copy link
Member

As mentioned in #39, our GeoParquet data pruning optimization and spatial join optimization does not handle geography types correctly, so we'd better disable them to avoid generating incorrect query results.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR disables optimized spatial join and GeoParquet data pruning for geography types to prevent incorrect query results. The optimization currently only works correctly with planar geometry types.

  • Adds type checking to ensure spatial join optimizations only apply to geometry types, not geography types
  • Disables GeoParquet data pruning for geography fields by removing bbox information from statistics
  • Includes comprehensive tests to verify geography types are not optimized

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
rust/sedona-spatial-join/src/optimizer.rs Adds type validation to prevent spatial join optimization for geography types
rust/sedona-spatial-join/src/exec.rs Updates test utilities and adds geography join test to verify optimization is disabled
rust/sedona-geoparquet/src/file_opener.rs Disables data pruning for geography fields by filtering out bbox statistics

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Kontinuation Kontinuation marked this pull request as ready for review September 11, 2025 11:56
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this!

A few things to consider now (or create an issue so we can follow up!)

Comment on lines 203 to 209
if is_prunable_geospatial_field(field) {
Ok(column_metadata.to_geo_statistics())
} else {
// Bounding box based pruning does not work for geography fields, so we remove
// the bbox from statistics to ensure that they are not used for pruning.
Ok(column_metadata.to_geo_statistics().with_bbox(None))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle this in literal_bounds() by returning a Full bounding box?

fn literal_bounds(literal: &Literal) -> Result<BoundingBox> {
let literal_field = literal.return_field(&Schema::empty())?;
let sedona_type = SedonaType::from_storage_field(&literal_field)?;

(Strictly speaking there's nothing wrong with the information in the file, it's just that we can't bound the literal)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that we cannot return a full bounding box for geography. We are pushing down ST_Contains and ST_Covers predicates as SpatialFilter::Covers, which requires the bbox of the geospatial column to contain the query window. Returning a full bounding box for geography will skip everything in such cases.

I'll try a different approach. The modification will still go into spatial_filter.rs and keep the bboxes in geo-statistics retrieved from GeoParquet files intact.

Comment on lines +2470 to +2474
#[test]
fn test_is_spatial_predicate_supported() {
// Planar geometry field
let geom_field = WKB_GEOMETRY.to_storage_field("geom", false).unwrap();
let schema = Arc::new(Schema::new(vec![geom_field.clone()]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would benefit from a high-level integration test in Python (i.e., make sure that a join with geography returns correct results). A join between "submodules/geoarrow-data/natural-earth/files/natural-earth_countries-geography_geo.parquet" and "submodules/geoarrow-data/natural-earth/files/natural-earth_countries-geography_geo.parquet" might be a good candidate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add this test case, but found that SedonaDB gives wildly different results from PostGIS. S2 is less likely to consider polygons that touch at boundaries as intersections. I can still add this Python test, but mark it as xfail.

@Kontinuation
Copy link
Member Author

I tried to use submodules/geoarrow-data/natural-earth/files/natural-earth_countries-geography_geo.parquet for running GeoParquet data pruning test, but found that the bbox of this file cannot be parsed because it does not conform to the standard.

Here is the file level geo metadata of that Parquet file:

{"version": "1.0.0", "primary_column": "geometry", "columns": {"geometry": {"encoding": "WKB", "geometry_types": [], "crs": {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84 (CRS84)", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)", "id": {"authority": "EPSG", "code": 1166}}, {"name": "World Geodetic System 1984 (G730)", "id": {"authority": "EPSG", "code": 1152}}, {"name": "World Geodetic System 1984 (G873)", "id": {"authority": "EPSG", "code": 1153}}, {"name": "World Geodetic System 1984 (G1150)", "id": {"authority": "EPSG", "code": 1154}}, {"name": "World Geodetic System 1984 (G1674)", "id": {"authority": "EPSG", "code": 1155}}, {"name": "World Geodetic System 1984 (G1762)", "id": {"authority": "EPSG", "code": 1156}}, {"name": "World Geodetic System 1984 (G2139)", "id": {"authority": "EPSG", "code": 1309}}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}, {"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}]}, "scope": "Not known.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "OGC", "code": "CRS84"}}, "edges": "spherical"}}}

The bbox field is:

"bbox": {
  "south_latitude": -90,
  "west_longitude": -180,
  "north_latitude": 90,
  "east_longitude": 180
}

@Kontinuation Kontinuation force-pushed the disable-sj-and-pruning-for-geog branch from 2508be6 to d7809f6 Compare September 11, 2025 16:03
Comment on lines 81 to 95
@pytest.mark.xfail(reason="https://github.com/apache/sedona-db/issues/63")
def test_spatial_join_geography(geoarrow_data):
eng_sedonadb = SedonaDB.create_or_skip()
eng_postgis = PostGIS.create_or_skip()

eng_sedonadb.create_table_parquet(
"sjoin_geog_all",
geoarrow_data
/ "natural-earth/files/natural-earth_countries-geography_geo.parquet",
)
test_data = eng_sedonadb.execute_and_collect(
"SELECT * FROM sjoin_geog_all LIMIT 100"
)
eng_sedonadb.create_table_arrow("sjoin_geog", test_data)
eng_postgis.create_table_arrow("sjoin_geog", test_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pasted this into the issue so we can track it there...in the meantime, we can either remove this or pick a less hard join (e.g., pick a small number of hard-coded or randomly generated points). This test is mostly just making sure that it works at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using randomly generated data for this geography join test. The bounds of generated data are disjoint on planar surface but intersects the antimeridian on spherical surface. This distribution of data ensures that we won't pass the test when handling geography objects as planar geometry objects.

@paleolimbot
Copy link
Member

The bbox field is:

Thank you for tracking that down!

I think that's the bbox of the CRS and not the top-level column metadata? (In any case there are a few battles before the behaviour of geography is sorted I think!)

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Kontinuation Kontinuation merged commit df1e0b7 into apache:main Sep 12, 2025
12 checks passed
Kontinuation added a commit that referenced this pull request Sep 16, 2025
This PR fixes a regression introduced by #57, which added geography type checking to prevent optimized spatial joins from being used with unsupported geography types.

The problematic patch incorrectly matches `left` and `right` expressions in `KNNPredicate` to query plans. Actually, how `left` and `right` maps to the original query plans depends on the value of `probe_side` field. This is quite misleading and may be addressed in a future PR by renaming `left` and `right` to `build` and `probe`.

The function for checking if geospatial types are supported now returns `Result<bool>` to propagate errors during the field type extraction and matching process. This prevents silent matching failures due to bugs from happening.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants