Skip to content

feat(rust/sedona-geoparquet): Support geometry_columns option in read_parquet(..) to mark additional geometry columns#560

Merged
paleolimbot merged 10 commits intoapache:mainfrom
2010YOUY01:read-parquet-opt
Feb 3, 2026
Merged

feat(rust/sedona-geoparquet): Support geometry_columns option in read_parquet(..) to mark additional geometry columns#560
paleolimbot merged 10 commits intoapache:mainfrom
2010YOUY01:read-parquet-opt

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Jan 29, 2026

Closes #530

Motivation

Today, converting legacy Parquet files that store geometry as raw WKB payloads inside BINARY columns into GeoParquet requires a full SQL rewrite pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the geometry column before writing:

# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

df = df.to_view("t", overwrite=True)

df = sd.sql("""
  SELECT
    ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
    * EXCLUDE (geo_bin)
  FROM t
""")

df.to_parquet("geo_geoparquet.parquet")

This works, but it would be have a easier to use python API:

“Treat this binary column as a geometry column with encoding=WKB and CRS=EPSG:4326.”

This PR introduces a geometry_columns option on read_parquet() so legacy Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.


Proposed Python API

Demo

df = sd.read_parquet(
    "/data/geo_legacy.parquet",
    geometry_columns={
        "geo_bin": {
            "encoding": "WKB",
            "crs": 4326,
        }
    },
)

df.to_parquet("geo_geoparquet.parquet")

Specification

            geometry_columns: Optional mapping of column name to GeoParquet column
                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
                binary WKB columns as geometry columns. Supported keys:
                - encoding: "WKB" (required)
                - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326).
                  If not provided, the default CRS is OGC:CRS84
                  (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means
                  the data in this column must be stored in longitude/latitude
                  based on the WGS84 datum.
                - edges: "planar" (default) or "spherical"
                Useful for:
                - Legacy Parquet files with Binary columns containing WKB payloads.
                - Overriding GeoParquet metadata when fields like `crs` are missing.
                Precedence:
                - If a column appears in both GeoParquet metadata and this option,
                  the geometry_columns entry takes precedence.
                Example:
                - For `geo.parquet(geo1: geometry, geo2: geometry, geo3: binary)`,
                  `read_parquet("geo.parquet", geometry_columns={"geo2": {...}, "geo3": ...})`
                  will override `geo2` metadata and treat `geo3` as a geometry column.
                Safety:
                - Columns specified here are not validated for WKB correctness.
                  Invalid WKB payloads may cause undefined behavior.

The key points:

  • The geometry columns specified in the option overrides what's already in the metadata, I think this can be useful if the metadata is missing some configurations like crs, and we can use this API to provide more details
  • No validation for now, this can be done in a follow-on PR

Key Changes

  1. Parse python option fields into rust GeoParquetColumnMetadata struct
  2. In the schema inference step, first infer the schema from GeoParquet metadata as before, next look at the options, to add/override additional geometry columns

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 29, 2026

I would love to hear feedbacks on the design and specification(in PR writeup)

Once we reach agreement on that, I will:

  • Add comprehensive tests
  • Polish the implementation — so far I have only validated the high-level structure; I haven’t yet reviewed the details (e.g. metadata parsing) carefully

@2010YOUY01 2010YOUY01 marked this pull request as draft January 29, 2026 11:49
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

I took a look at the whole thing but I know you're still working so feel free to ignore comments that aren't in scope.

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

/// Metadata about geometry columns. Each key is the name of a geometry column in the table.
pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

self,
table_paths: Union[str, Path, Iterable[str]],
options: Optional[Dict[str, Any]] = None,
geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably make this just Optional[Mapping[str, Any]]. In Python a user can rather easily decode that from JSON if they need to.

Comment on lines 139 to 140
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns. Supported keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns. Supported keys:
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns or correct metadata such
as the column CRS. Supported keys:

Comment on lines 37 to 40
fn parse_geometry_columns<'py>(
py: Python<'py>,
geometry_columns: HashMap<String, PyObject>,
) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bit can be avoided by just passing a string at this point (i.e., in Python, use json.dumps() before passing to Rust).

py: Python<'py>,
table_paths: Vec<String>,
options: HashMap<String, PyObject>,
geometry_columns: Option<HashMap<String, PyObject>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
geometry_columns: Option<HashMap<String, PyObject>>,
geometry_columns: Option<String>,

..I think JSON is the right format for this particular step (reduces bindings code considerably!)

src = tmp_path / "plain.parquet"
pq.write_table(table, src)

# Check metadata: geoparquet meatadata should not be available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check metadata: geoparquet meatadata should not be available
# Check metadata: geoparquet metadata should not be available

Comment on lines 128 to 136
out = tmp_path / "geo.parquet"
df.to_parquet(out)
metadata = pq.read_metadata(out).metadata
assert metadata is not None
geo = metadata.get(b"geo")
assert geo is not None
geo_metadata = json.loads(geo.decode("utf-8"))
print(json.dumps(geo_metadata, indent=2, sort_keys=True))
assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably skip this bit of the test (verifying the geometry-ness and CRS of the input seems reasonable to me).


// Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0"
let crs = if LngLat::is_str_lnglat(crs_str) {
let crs = if crs_str == "OGC:CRS84" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should be reverted (there is >1 string that can represent lon/lat)

}
}

if let Some(number) = crs_value.as_number() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is OK (but perhaps add a test)

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 30, 2026

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

/// Metadata about geometry columns. Each key is the name of a geometry column in the table.
pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

It's a great idea to make the rust internal API easier to use for different frontend bindings, WDYT:

  • For the builder of Rust Struct GeoParquetReadOptions, it takes string for Json options to make it easier to use
  • Inside GeoParquetReadOptions, it keeps typed/parsed field for geometry_columns, to make the rust backend impl cleaner

Now the API looks like:

pub struct GeoParquetReadOptions<'a> {
    inner: ParquetReadOptions<'a>,
    table_options: Option<HashMap<String, String>>,
    // Keep it typed to make backend impl cleaner
    geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
}

impl GeoParquetReadOptions {
    // ...
    pub fn with_geometry_columns(
        mut self,
        // Json config string like "{"geo": {"encoding": "wkb"}}"
        geometry_columns: String,
    ) -> Self {
        let parsed = parse_geometry_columns(geometry_columns);
        self.geometry_columns = Some(parsed);
        self
    }
}

@paleolimbot
Copy link
Member

Thank you for considering...sounds good to me!

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Feb 2, 2026

This PR is reworked, the TLDR for the option semantics is:

  1. For a regular parquet file with binary column (but physically WKB-encoded), use this option to specify Binary column as geometry
sd.read_parquet(
    "geo_legacy.parquet",
    geometry_columns={
        "geometry": {"encoding": "WKB", "crs": "EPSG:4326", "edges": "planar"}
    },
)
  1. If a column is already geometry (inferred from parquet metadata), this option can be used to provide optional but missing field; if one field is already inferred from metadata, and set again from the option, an error occur. This feels safer to me, but I'm open to other opinions.
# Inferred option from metadata:
#     {"encoding": "WKB"} # "crs" is missing

# Provided 'crs' option from `geometry_columns` is allowed
sd.read_parquet(
    "geo.parquet",
    geometry_columns={
        "geometry": {"crs": "EPSG:4326"}
    },
)
# Now 'geometry' column is a geometry column with crs=4326
# Inferred option from metadata:
#     {"encoding": "WKB", "crs": "EPSG:4326"}

# Not allowed to provide option that is already inferred from metadata
sd.read_parquet(
    "geo.parquet",
    geometry_columns={
        "geometry": {"crs": "EPSG:3857"}
    },
)
# Errors...

Implementation/Key changes

(existing)
geoparquet metadata --> (per col) GeoParquetColumnMetadata --> schema

(PR)
geoparquet metadata --> (per col) GeoParquetColumnMetadata ----+
                                                               | (combine)
                                                               |
user option geometry_columns --> GeoParquetColumnMetadata -----+--> schema
  1. Parse option with serde_json::from_str, the same as parquet metadata, and store the column options inside GeoParquetFormat -> TableGeoParquetOption, since FileFormat trait is used to build schema. When infer_schema() is called, combine the GeoParquetColumnMetadata from both metadata and geometry_columns option.
  2. Refactor the GeoParquetColumnMetadata to make its encoding field optional. Since this is a required field for GeoParquet spec, assertions are added to the existing deserializer to ensure it exist in the parquet metadata

@2010YOUY01 2010YOUY01 marked this pull request as ready for review February 2, 2026 15:09
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

This is great! My main remaining high-level comment is that I think we should just have the geometry_columns be a pure override and not attempt to merge anything. I think this is simpler but also potentially more useful (can be used as an escape hatch to read a wider variety of input that provides incomplete or incorrect information that is difficult to otherwise fix).

Note that there is some minor overlap with #561 ...I'd prefer to merge your PR first and I can handle whatever merge conflict may arise.

Comment on lines 322 to 329
for metadata in &metadatas {
if let Some(kv) = metadata.file_metadata().key_value_metadata() {
for item in kv {
if item.key != "geo" {
continue;
}
if let Some(value) = &item.value {
let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply the column overrides here and eliminate the somewhat complicated logic below?

Note that in #561 this is simplified to just use try_from_parquet_metadata().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be applicable after we change to the fully overwrite semantics? This combining step is only merging bbox and geo types, and reject any other conflicts, and the overwriting step has to be done separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.

@2010YOUY01
Copy link
Contributor Author

Thank you!

This is great! My main remaining high-level comment is that I think we should just have the geometry_columns be a pure override and not attempt to merge anything. I think this is simpler but also potentially more useful (can be used as an escape hatch to read a wider variety of input that provides incomplete or incorrect information that is difficult to otherwise fix).

Note that there is some minor overlap with #561 ...I'd prefer to merge your PR first and I can handle whatever merge conflict may arise.

Thank you for the review — that makes sense. I’ve switched to pure override semantics. Just realized for geoparquets with incorrect/missing metadata, sedonadb is likely the tool to fix them, so we need full overwrites here.

For safety, adding validation support can be a follow-up.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines 322 to 329
for metadata in &metadatas {
if let Some(kv) = metadata.file_metadata().key_value_metadata() {
for item in kv {
if item.key != "geo" {
continue;
}
if let Some(value) = &item.value {
let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.

@paleolimbot paleolimbot merged commit 6c1fc02 into apache:main Feb 3, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python API to cast binary columns to WKB columns

2 participants