feat(rust/sedona-geoparquet): Support WKB validation in read_parquet()#578
feat(rust/sedona-geoparquet): Support WKB validation in read_parquet()#5782010YOUY01 wants to merge 5 commits intoapache:mainfrom
read_parquet()#578Conversation
| pub object_store: Arc<dyn ObjectStore>, | ||
| pub metadata_size_hint: Option<usize>, | ||
| pub predicate: Arc<dyn PhysicalExpr>, | ||
| pub predicate: Option<Arc<dyn PhysicalExpr>>, |
There was a problem hiding this comment.
refactor 1: before it's required since GeoParquetFileOpener is only used when there is geo column, and also there is valid spatial predicate, otherwise it fall back to inner parquet opener.
Now it also have to be used if validation is enabled, and there is no spatial predicate, so making it optional.
| pub enable_pruning: bool, | ||
| pub metrics: GeoParquetFileOpenerMetrics, | ||
| pub overrides: Option<HashMap<String, GeoParquetColumnMetadata>>, | ||
| pub options: TableGeoParquetOptions, |
There was a problem hiding this comment.
refactor 2: before overrides lives inside TableGeoParquetOptions, the new validate flag also lives inside that option, so I think only keeping the option here can make it simpler.
| metadata_size_hint: Option<usize>, | ||
| predicate: Option<Arc<dyn PhysicalExpr>>, | ||
| overrides: Option<HashMap<String, GeoParquetColumnMetadata>>, | ||
| options: TableGeoParquetOptions, |
There was a problem hiding this comment.
refactor 3: similar to another refactor in file_opener.rs
paleolimbot
left a comment
There was a problem hiding this comment.
A few optional things to consider...thank you!
Just curious...have you run into files where this is a problem? I imagine if it does happen the existing error would be rather cryptic (although I believe we're good about validating WKB payload in every operator).
| def test_read_parquet_validate_wkb_single_valid_row(con, tmp_path): | ||
| valid_wkb = bytes.fromhex("0101000000000000000000F03F0000000000000040") | ||
|
|
||
| table = pa.table({"id": [1], "geom": [valid_wkb]}) | ||
| path = tmp_path / "single_valid_wkb.parquet" | ||
| pq.write_table(table, path) |
There was a problem hiding this comment.
Optional, but there is a io/test_parquet.py that is possibly a better fit for these tests (and the geometry_columns tests) since there are now a lot of them!
| Supported validation properties: | ||
| - WKB encoding |
There was a problem hiding this comment.
| Supported validation properties: | |
| - WKB encoding | |
| Currently the only property that is validated is the WKB of input geometry | |
| columns. |
| DataType::Binary => { | ||
| let array = as_binary_array(array)?; | ||
| for (row_index, maybe_wkb) in array.iter().enumerate() { | ||
| if let Some(wkb_bytes) = maybe_wkb { | ||
| if let Err(e) = wkb::reader::read_wkb(wkb_bytes) { | ||
| return exec_err!( | ||
| "WKB validation failed for column '{}' at row {}: {}", | ||
| column_name, | ||
| row_index, | ||
| e | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
While we're here, it would be good to handle the LargeBinary case (this can happen for reasons outside a GeoParquet producer's control due to various auto-store/load of schemas)
Follow up to #560
Rationale
read_parquet()now supports an override option,geometry_columns, to cast Binary columns to geometries. This is an unsafe operation, so adding a validation step is helpful.This PR currently limits validation to WKB validity. In the long term, additional metadata properties can also be validated. If any entry fails validation, an error will be returned.
Geo columns are inferred from {(Geo)Parquet metadata, user-provided
geometry_columnsoverride options}. Thevalidateoption applies uniformly in all cases—ifvalidate = true, geometry columns are always validated. I think this approach is more general and simpler.Implementation
Propagate the
validateoption toGeoParquetFileOpener(which yields the final decoded batches as the data source output), and use it to optionally validate the decoded batches for WKB validity.