Skip to content

feat(rust/sedona-geoparquet): Support WKB validation in read_parquet()#578

Open
2010YOUY01 wants to merge 5 commits intoapache:mainfrom
2010YOUY01:read-parquet-validation
Open

feat(rust/sedona-geoparquet): Support WKB validation in read_parquet()#578
2010YOUY01 wants to merge 5 commits intoapache:mainfrom
2010YOUY01:read-parquet-validation

Conversation

@2010YOUY01
Copy link
Contributor

Follow up to #560

Rationale

read_parquet() now supports an override option, geometry_columns, to cast Binary columns to geometries. This is an unsafe operation, so adding a validation step is helpful.

This PR currently limits validation to WKB validity. In the long term, additional metadata properties can also be validated. If any entry fails validation, an error will be returned.

Geo columns are inferred from {(Geo)Parquet metadata, user-provided geometry_columns override options}. The validate option applies uniformly in all cases—if validate = true, geometry columns are always validated. I think this approach is more general and simpler.

Implementation

Propagate the validate option to GeoParquetFileOpener (which yields the final decoded batches as the data source output), and use it to optionally validate the decoded batches for WKB validity.

pub object_store: Arc<dyn ObjectStore>,
pub metadata_size_hint: Option<usize>,
pub predicate: Arc<dyn PhysicalExpr>,
pub predicate: Option<Arc<dyn PhysicalExpr>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 1: before it's required since GeoParquetFileOpener is only used when there is geo column, and also there is valid spatial predicate, otherwise it fall back to inner parquet opener.

Now it also have to be used if validation is enabled, and there is no spatial predicate, so making it optional.

pub enable_pruning: bool,
pub metrics: GeoParquetFileOpenerMetrics,
pub overrides: Option<HashMap<String, GeoParquetColumnMetadata>>,
pub options: TableGeoParquetOptions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 2: before overrides lives inside TableGeoParquetOptions, the new validate flag also lives inside that option, so I think only keeping the option here can make it simpler.

metadata_size_hint: Option<usize>,
predicate: Option<Arc<dyn PhysicalExpr>>,
overrides: Option<HashMap<String, GeoParquetColumnMetadata>>,
options: TableGeoParquetOptions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 3: similar to another refactor in file_opener.rs

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few optional things to consider...thank you!

Just curious...have you run into files where this is a problem? I imagine if it does happen the existing error would be rather cryptic (although I believe we're good about validating WKB payload in every operator).

Comment on lines +318 to +323
def test_read_parquet_validate_wkb_single_valid_row(con, tmp_path):
valid_wkb = bytes.fromhex("0101000000000000000000F03F0000000000000040")

table = pa.table({"id": [1], "geom": [valid_wkb]})
path = tmp_path / "single_valid_wkb.parquet"
pq.write_table(table, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but there is a io/test_parquet.py that is possibly a better fit for these tests (and the geometry_columns tests) since there are now a lot of them!

Comment on lines +190 to +191
Supported validation properties:
- WKB encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Supported validation properties:
- WKB encoding
Currently the only property that is validated is the WKB of input geometry
columns.

Comment on lines +233 to +247
DataType::Binary => {
let array = as_binary_array(array)?;
for (row_index, maybe_wkb) in array.iter().enumerate() {
if let Some(wkb_bytes) = maybe_wkb {
if let Err(e) = wkb::reader::read_wkb(wkb_bytes) {
return exec_err!(
"WKB validation failed for column '{}' at row {}: {}",
column_name,
row_index,
e
);
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here, it would be good to handle the LargeBinary case (this can happen for reasons outside a GeoParquet producer's control due to various auto-store/load of schemas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants