feat(rust/sedona-geoparquet): Support `geometry_columns` option in `read_parquet(..)` to mark additional geometry columns by 2010YOUY01 · Pull Request #560 · apache/sedona-db

2010YOUY01 · 2026-01-29T11:42:12Z

Closes #530

Motivation

Today, converting legacy Parquet files that store geometry as raw WKB payloads inside BINARY columns into GeoParquet requires a full SQL rewrite pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the geometry column before writing:

# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

df = df.to_view("t", overwrite=True)

df = sd.sql("""
  SELECT
    ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
    * EXCLUDE (geo_bin)
  FROM t
""")

df.to_parquet("geo_geoparquet.parquet")

This works, but it would be have a easier to use python API:

“Treat this binary column as a geometry column with encoding=WKB and CRS=EPSG:4326.”

This PR introduces a geometry_columns option on read_parquet() so legacy Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.

Proposed Python API

Demo

df = sd.read_parquet(
    "/data/geo_legacy.parquet",
    geometry_columns={
        "geo_bin": {
            "encoding": "WKB",
            "crs": 4326,
        }
    },
)

df.to_parquet("geo_geoparquet.parquet")

Specification

            geometry_columns: Optional mapping of column name to GeoParquet column
                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
                binary WKB columns as geometry columns. Supported keys:
                - encoding: "WKB" (required)
                - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326).
                  If not provided, the default CRS is OGC:CRS84
                  (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means
                  the data in this column must be stored in longitude/latitude
                  based on the WGS84 datum.
                - edges: "planar" (default) or "spherical"
                Useful for:
                - Legacy Parquet files with Binary columns containing WKB payloads.
                - Overriding GeoParquet metadata when fields like `crs` are missing.
                Precedence:
                - If a column appears in both GeoParquet metadata and this option,
                  the geometry_columns entry takes precedence.
                Example:
                - For `geo.parquet(geo1: geometry, geo2: geometry, geo3: binary)`,
                  `read_parquet("geo.parquet", geometry_columns={"geo2": {...}, "geo3": ...})`
                  will override `geo2` metadata and treat `geo3` as a geometry column.
                Safety:
                - Columns specified here are not validated for WKB correctness.
                  Invalid WKB payloads may cause undefined behavior.

The key points:

The geometry columns specified in the option overrides what's already in the metadata, I think this can be useful if the metadata is missing some configurations like crs, and we can use this API to provide more details
No validation for now, this can be done in a follow-on PR

Key Changes

Parse python option fields into rust GeoParquetColumnMetadata struct
In the schema inference step, first infer the schema from GeoParquet metadata as before, next look at the options, to add/override additional geometry columns

2010YOUY01 · 2026-01-29T11:45:06Z

I would love to hear feedbacks on the design and specification(in PR writeup)

Once we reach agreement on that, I will:

Add comprehensive tests
Polish the implementation — so far I have only validated the high-level structure; I haven’t yet reviewed the details (e.g. metadata parsing) carefully

paleolimbot

Thank you for this!

I took a look at the whole thing but I know you're still working so feel free to ignore comments that aren't in scope.

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

sedona-db/rust/sedona-geoparquet/src/metadata.rs

Lines 292 to 293 in 3f91e26

    
               /// Metadata about geometry columns. Each key is the name of a geometry column in the table. 
        
               pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

paleolimbot · 2026-01-29T15:10:18Z

python/sedonadb/python/sedonadb/context.py

        self,
        table_paths: Union[str, Path, Iterable[str]],
        options: Optional[Dict[str, Any]] = None,
+        geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None,


I would probably make this just Optional[Mapping[str, Any]]. In Python a user can rather easily decode that from JSON if they need to.

python/sedonadb/python/sedonadb/context.py

paleolimbot · 2026-01-29T15:19:47Z

python/sedonadb/python/sedonadb/context.py

+                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
+                binary WKB columns as geometry columns. Supported keys:


Suggested change

metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark

binary WKB columns as geometry columns. Supported keys:

metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark

binary WKB columns as geometry columns or correct metadata such

as the column CRS. Supported keys:

paleolimbot · 2026-01-29T15:20:59Z

python/sedonadb/src/context.rs

+fn parse_geometry_columns<'py>(
+    py: Python<'py>,
+    geometry_columns: HashMap<String, PyObject>,
+) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> {


I think this bit can be avoided by just passing a string at this point (i.e., in Python, use json.dumps() before passing to Rust).

paleolimbot · 2026-01-29T15:21:55Z

python/sedonadb/src/context.rs

        py: Python<'py>,
        table_paths: Vec<String>,
        options: HashMap<String, PyObject>,
+        geometry_columns: Option<HashMap<String, PyObject>>,


Suggested change

geometry_columns: Option<HashMap<String, PyObject>>,

geometry_columns: Option<String>,

..I think JSON is the right format for this particular step (reduces bindings code considerably!)

paleolimbot · 2026-01-29T15:22:30Z

python/sedonadb/tests/test_context.py

+    src = tmp_path / "plain.parquet"
+    pq.write_table(table, src)
+
+    # Check metadata: geoparquet meatadata should not be available


Suggested change

# Check metadata: geoparquet meatadata should not be available

# Check metadata: geoparquet metadata should not be available

paleolimbot · 2026-01-29T15:24:59Z

python/sedonadb/tests/test_context.py

+    out = tmp_path / "geo.parquet"
+    df.to_parquet(out)
+    metadata = pq.read_metadata(out).metadata
+    assert metadata is not None
+    geo = metadata.get(b"geo")
+    assert geo is not None
+    geo_metadata = json.loads(geo.decode("utf-8"))
+    print(json.dumps(geo_metadata, indent=2, sort_keys=True))
+    assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326"


I think you can probably skip this bit of the test (verifying the geometry-ness and CRS of the input seems reasonable to me).

rust/sedona-geoparquet/src/provider.rs

paleolimbot · 2026-01-29T15:27:42Z

rust/sedona-schema/src/crs.rs


    // Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0"
-    let crs = if LngLat::is_str_lnglat(crs_str) {
+    let crs = if crs_str == "OGC:CRS84" {


These changes should be reverted (there is >1 string that can represent lon/lat)

paleolimbot · 2026-01-29T15:28:23Z

rust/sedona-schema/src/crs.rs

        }
    }

+    if let Some(number) = crs_value.as_number() {


This part is OK (but perhaps add a test)

2010YOUY01 · 2026-01-30T07:09:41Z

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

sedona-db/rust/sedona-geoparquet/src/metadata.rs

Lines 292 to 293 in 3f91e26

/// Metadata about geometry columns. Each key is the name of a geometry column in the table.

pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

It's a great idea to make the rust internal API easier to use for different frontend bindings, WDYT:

For the builder of Rust Struct GeoParquetReadOptions, it takes string for Json options to make it easier to use
Inside GeoParquetReadOptions, it keeps typed/parsed field for geometry_columns, to make the rust backend impl cleaner

Now the API looks like:

pub struct GeoParquetReadOptions<'a> {
    inner: ParquetReadOptions<'a>,
    table_options: Option<HashMap<String, String>>,
    // Keep it typed to make backend impl cleaner
    geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
}

impl GeoParquetReadOptions {
    // ...
    pub fn with_geometry_columns(
        mut self,
        // Json config string like "{"geo": {"encoding": "wkb"}}"
        geometry_columns: String,
    ) -> Self {
        let parsed = parse_geometry_columns(geometry_columns);
        self.geometry_columns = Some(parsed);
        self
    }
}

paleolimbot · 2026-01-30T14:41:32Z

Thank you for considering...sounds good to me!

…ata override

2010YOUY01 · 2026-02-02T14:29:24Z

This PR is reworked, the TLDR for the option semantics is:

For a regular parquet file with binary column (but physically WKB-encoded), use this option to specify Binary column as geometry

sd.read_parquet(
    "geo_legacy.parquet",
    geometry_columns={
        "geometry": {"encoding": "WKB", "crs": "EPSG:4326", "edges": "planar"}
    },
)

If a column is already geometry (inferred from parquet metadata), this option can be used to provide optional but missing field; if one field is already inferred from metadata, and set again from the option, an error occur. This feels safer to me, but I'm open to other opinions.

# Inferred option from metadata:
#     {"encoding": "WKB"} # "crs" is missing

# Provided 'crs' option from `geometry_columns` is allowed
sd.read_parquet(
    "geo.parquet",
    geometry_columns={
        "geometry": {"crs": "EPSG:4326"}
    },
)
# Now 'geometry' column is a geometry column with crs=4326

# Inferred option from metadata:
#     {"encoding": "WKB", "crs": "EPSG:4326"}

# Not allowed to provide option that is already inferred from metadata
sd.read_parquet(
    "geo.parquet",
    geometry_columns={
        "geometry": {"crs": "EPSG:3857"}
    },
)
# Errors...

Implementation/Key changes

(existing)
geoparquet metadata --> (per col) GeoParquetColumnMetadata --> schema

(PR)
geoparquet metadata --> (per col) GeoParquetColumnMetadata ----+
                                                               | (combine)
                                                               |
user option geometry_columns --> GeoParquetColumnMetadata -----+--> schema

Parse option with serde_json::from_str, the same as parquet metadata, and store the column options inside GeoParquetFormat -> TableGeoParquetOption, since FileFormat trait is used to build schema. When infer_schema() is called, combine the GeoParquetColumnMetadata from both metadata and geometry_columns option.
Refactor the GeoParquetColumnMetadata to make its encoding field optional. Since this is a required field for GeoParquet spec, assertions are added to the existing deserializer to ensure it exist in the parquet metadata

paleolimbot

Thank you!

This is great! My main remaining high-level comment is that I think we should just have the geometry_columns be a pure override and not attempt to merge anything. I think this is simpler but also potentially more useful (can be used as an escape hatch to read a wider variety of input that provides incomplete or incorrect information that is difficult to otherwise fix).

Note that there is some minor overlap with #561 ...I'd prefer to merge your PR first and I can handle whatever merge conflict may arise.

python/sedonadb/python/sedonadb/context.py

python/sedonadb/tests/test_context.py

rust/sedona-geoparquet/src/provider.rs

rust/sedona-geoparquet/src/lib.rs

rust/sedona-geoparquet/src/format.rs

paleolimbot · 2026-02-02T16:24:48Z

rust/sedona-geoparquet/src/format.rs

        for metadata in &metadatas {
            if let Some(kv) = metadata.file_metadata().key_value_metadata() {
                for item in kv {
                    if item.key != "geo" {
                        continue;
                    }
                    if let Some(value) = &item.value {
                        let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?;


Can we apply the column overrides here and eliminate the somewhat complicated logic below?

Note that in #561 this is simplified to just use try_from_parquet_metadata().

I think this should not be applicable after we change to the fully overwrite semantics? This combining step is only merging bbox and geo types, and reject any other conflicts, and the overwriting step has to be done separately.

I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.

rust/sedona-geoparquet/src/format.rs

python/sedonadb/python/sedonadb/context.py

2010YOUY01 · 2026-02-03T09:56:14Z

Thank you!

This is great! My main remaining high-level comment is that I think we should just have the geometry_columns be a pure override and not attempt to merge anything. I think this is simpler but also potentially more useful (can be used as an escape hatch to read a wider variety of input that provides incomplete or incorrect information that is difficult to otherwise fix).

Note that there is some minor overlap with #561 ...I'd prefer to merge your PR first and I can handle whatever merge conflict may arise.

Thank you for the review — that makes sense. I’ve switched to pure override semantics. Just realized for geoparquets with incorrect/missing metadata, sedonadb is likely the tool to fix them, so we need full overwrites here.

For safety, adding validation support can be a follow-up.

paleolimbot

Thank you!

paleolimbot · 2026-02-03T15:08:04Z

rust/sedona-geoparquet/src/format.rs

        for metadata in &metadatas {
            if let Some(kv) = metadata.file_metadata().key_value_metadata() {
                for item in kv {
                    if item.key != "geo" {
                        continue;
                    }
                    if let Some(value) = &item.value {
                        let this_geoparquet_metadata = GeoParquetMetadata::try_new(value)?;


I can give this a try in the Parquet geometry/geography PR. I think we want to apply them before the call to try_update() so the overrides can be used to avoid an erroneous or difficult to work around conflict.

2010YOUY01 added 2 commits January 29, 2026 19:27

feat: support geometry_columns option in Python read_parquet() API

a2b4d12

Merge branch 'main' into read-parquet-opt

fe78e42

2010YOUY01 marked this pull request as draft January 29, 2026 11:49

paleolimbot reviewed Jan 29, 2026

View reviewed changes

2010YOUY01 added 2 commits February 2, 2026 21:51

Use serde_json to parse column option; support partially column metad…

afdd1c7

…ata override

CI

5dba47d

CI

7db6578

2010YOUY01 marked this pull request as ready for review February 2, 2026 15:09

paleolimbot reviewed Feb 2, 2026

View reviewed changes

2010YOUY01 added 2 commits February 3, 2026 17:24

review: Use overwrite semantics, revert unnecessary changes

5113c2f

review: polish

dd75822

2010YOUY01 added 3 commits February 3, 2026 18:01

cleanup unrelated changes

4ce42e8

CI

87b98f1

comment cleanup

3505559

paleolimbot approved these changes Feb 3, 2026

View reviewed changes

paleolimbot merged commit 6c1fc02 into apache:main Feb 3, 2026
15 checks passed

2010YOUY01 mentioned this pull request Feb 6, 2026

feat(rust/sedona-geoparquet): Support WKB validation in read_parquet() #578

Open

	/// Metadata about geometry columns. Each key is the name of a geometry column in the table.
	pub columns: HashMap<String, GeoParquetColumnMetadata>,

		metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
		binary WKB columns as geometry columns. Supported keys:

	geometry_columns: Option<HashMap<String, PyObject>>,
	geometry_columns: Option<String>,

	# Check metadata: geoparquet meatadata should not be available
	# Check metadata: geoparquet metadata should not be available

Conversation

2010YOUY01 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposed Python API

Demo

Specification

Key Changes

Uh oh!

2010YOUY01 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Jan 30, 2026

Uh oh!

2010YOUY01 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation/Key changes

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 commented Feb 3, 2026

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2010YOUY01 commented Jan 29, 2026 •

edited

Loading

2010YOUY01 commented Jan 29, 2026 •

edited

Loading

2010YOUY01 commented Jan 30, 2026 •

edited

Loading

2010YOUY01 commented Feb 2, 2026 •

edited

Loading