feat(rust/sedona-geoparquet): Support geometry_columns option in read_parquet(..) to mark additional geometry columns#560
Conversation
|
I would love to hear feedbacks on the design and specification(in PR writeup) Once we reach agreement on that, I will:
|
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for this!
I took a look at the whole thing but I know you're still working so feel free to ignore comments that aren't in scope.
Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.
sedona-db/rust/sedona-geoparquet/src/metadata.rs
Lines 292 to 293 in 3f91e26
Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.
| self, | ||
| table_paths: Union[str, Path, Iterable[str]], | ||
| options: Optional[Dict[str, Any]] = None, | ||
| geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, |
There was a problem hiding this comment.
I would probably make this just Optional[Mapping[str, Any]]. In Python a user can rather easily decode that from JSON if they need to.
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | ||
| binary WKB columns as geometry columns. Supported keys: |
There was a problem hiding this comment.
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | |
| binary WKB columns as geometry columns. Supported keys: | |
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | |
| binary WKB columns as geometry columns or correct metadata such | |
| as the column CRS. Supported keys: |
python/sedonadb/src/context.rs
Outdated
| fn parse_geometry_columns<'py>( | ||
| py: Python<'py>, | ||
| geometry_columns: HashMap<String, PyObject>, | ||
| ) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> { |
There was a problem hiding this comment.
I think this bit can be avoided by just passing a string at this point (i.e., in Python, use json.dumps() before passing to Rust).
python/sedonadb/src/context.rs
Outdated
| py: Python<'py>, | ||
| table_paths: Vec<String>, | ||
| options: HashMap<String, PyObject>, | ||
| geometry_columns: Option<HashMap<String, PyObject>>, |
There was a problem hiding this comment.
| geometry_columns: Option<HashMap<String, PyObject>>, | |
| geometry_columns: Option<String>, |
..I think JSON is the right format for this particular step (reduces bindings code considerably!)
| src = tmp_path / "plain.parquet" | ||
| pq.write_table(table, src) | ||
|
|
||
| # Check metadata: geoparquet meatadata should not be available |
There was a problem hiding this comment.
| # Check metadata: geoparquet meatadata should not be available | |
| # Check metadata: geoparquet metadata should not be available |
| out = tmp_path / "geo.parquet" | ||
| df.to_parquet(out) | ||
| metadata = pq.read_metadata(out).metadata | ||
| assert metadata is not None | ||
| geo = metadata.get(b"geo") | ||
| assert geo is not None | ||
| geo_metadata = json.loads(geo.decode("utf-8")) | ||
| print(json.dumps(geo_metadata, indent=2, sort_keys=True)) | ||
| assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326" |
There was a problem hiding this comment.
I think you can probably skip this bit of the test (verifying the geometry-ness and CRS of the input seems reasonable to me).
rust/sedona-schema/src/crs.rs
Outdated
|
|
||
| // Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0" | ||
| let crs = if LngLat::is_str_lnglat(crs_str) { | ||
| let crs = if crs_str == "OGC:CRS84" { |
There was a problem hiding this comment.
These changes should be reverted (there is >1 string that can represent lon/lat)
rust/sedona-schema/src/crs.rs
Outdated
| } | ||
| } | ||
|
|
||
| if let Some(number) = crs_value.as_number() { |
There was a problem hiding this comment.
This part is OK (but perhaps add a test)
It's a great idea to make the rust internal API easier to use for different frontend bindings, WDYT:
Now the API looks like: pub struct GeoParquetReadOptions<'a> {
inner: ParquetReadOptions<'a>,
table_options: Option<HashMap<String, String>>,
// Keep it typed to make backend impl cleaner
geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
}
impl GeoParquetReadOptions {
// ...
pub fn with_geometry_columns(
mut self,
// Json config string like "{"geo": {"encoding": "wkb"}}"
geometry_columns: String,
) -> Self {
let parsed = parse_geometry_columns(geometry_columns);
self.geometry_columns = Some(parsed);
self
}
} |
|
Thank you for considering...sounds good to me! |
|
This PR is reworked, the TLDR for the option semantics is:
Implementation/Key changes
|
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you!
This is great! My main remaining high-level comment is that I think we should just have the geometry_columns be a pure override and not attempt to merge anything. I think this is simpler but also potentially more useful (can be used as an escape hatch to read a wider variety of input that provides incomplete or incorrect information that is difficult to otherwise fix).
Note that there is some minor overlap with #561 ...I'd prefer to merge your PR first and I can handle whatever merge conflict may arise.
| for metadata in &metadatas { | ||
| if let Some(kv) = metadata.file_metadata().key_value_metadata() { | ||
| for item in kv { | ||
| if item.key != "geo" { | ||
| continue; | ||
| } | ||
| if let Some(value) = &item.value { | ||
| let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?; |
There was a problem hiding this comment.
Can we apply the column overrides here and eliminate the somewhat complicated logic below?
Note that in #561 this is simplified to just use try_from_parquet_metadata().
There was a problem hiding this comment.
I think this should not be applicable after we change to the fully overwrite semantics? This combining step is only merging bbox and geo types, and reject any other conflicts, and the overwriting step has to be done separately.
There was a problem hiding this comment.
I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.
Thank you for the review — that makes sense. I’ve switched to pure override semantics. Just realized for geoparquets with incorrect/missing metadata, For safety, adding validation support can be a follow-up. |
| for metadata in &metadatas { | ||
| if let Some(kv) = metadata.file_metadata().key_value_metadata() { | ||
| for item in kv { | ||
| if item.key != "geo" { | ||
| continue; | ||
| } | ||
| if let Some(value) = &item.value { | ||
| let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?; |
There was a problem hiding this comment.
I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.
Closes #530
Motivation
Today, converting legacy Parquet files that store geometry as raw WKB payloads inside
BINARYcolumns into GeoParquet requires a full SQL rewrite pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the geometry column before writing:This works, but it would be have a easier to use python API:
This PR introduces a
geometry_columnsoption onread_parquet()so legacy Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.Proposed Python API
Demo
Specification
The key points:
crs, and we can use this API to provide more detailsKey Changes
GeoParquetColumnMetadatastructGeoParquetmetadata as before, next look at the options, to add/override additional geometry columns