Skip to content

Commit 304084e

Browse files
committed
Fix geospatial CI regressions
1 parent 10fbafc commit 304084e

File tree

9 files changed

+572
-24
lines changed

9 files changed

+572
-24
lines changed
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Geospatial Types Implementation Decisions
2+
3+
This document records design decisions made during the implementation of Iceberg v3 geospatial primitive types (`geometry` and `geography`) in PyIceberg.
4+
5+
## Decision 1: Type Parameters Storage
6+
7+
**Decision**: Store CRS (coordinate reference system) and algorithm as string fields in the type classes, with defaults matching the Iceberg specification.
8+
9+
**Alternatives Considered**:
10+
11+
1. Store as tuple in `root` field (like DecimalType does with precision/scale)
12+
2. Store as separate class attributes with ClassVar defaults
13+
14+
**Rationale**: Using explicit fields with defaults allows for proper serialization/deserialization with Pydantic while maintaining singleton behavior for default-configured types. The spec defines defaults as CRS=`"OGC:CRS84"` and algorithm=`"spherical"` for geography.
15+
16+
**Spec Citations**:
17+
18+
- "Primitive Types" section: `geometry(C)` and `geography(C, A)` type definitions
19+
- Default CRS is `"OGC:CRS84"`, default algorithm is `"spherical"`
20+
21+
**Reviewer Concerns Anticipated**: Singleton pattern may not work correctly with parameterized types - addressed by inheriting from Singleton and implementing `__getnewargs__` properly.
22+
23+
---
24+
25+
## Decision 2: Format Version Enforcement
26+
27+
**Decision**: Enforce `minimum_format_version() = 3` for both GeometryType and GeographyType.
28+
29+
**Alternatives Considered**:
30+
31+
1. Allow in format version 2 with a warning
32+
2. Allow without restriction
33+
34+
**Rationale**: Geospatial types are defined as Iceberg v3 features. Allowing them in earlier versions would break spec compliance and interoperability with other Iceberg implementations.
35+
36+
**Spec Citations**: These types are defined in the v3 specification.
37+
38+
**Reviewer Concerns Anticipated**: Users on v2 tables cannot use geospatial types - this is expected behavior per spec.
39+
40+
---
41+
42+
## Decision 3: WKB/WKT Boundary
43+
44+
**Decision**:
45+
46+
- Data files use WKB (Well-Known Binary) - stored as `bytes` at runtime
47+
- JSON single-value serialization uses WKT (Well-Known Text) strings
48+
- Currently, we do NOT implement WKB<->WKT conversion (requires external dependencies like Shapely/GEOS)
49+
50+
**Alternatives Considered**:
51+
52+
1. Add optional Shapely dependency for conversion
53+
2. Implement basic WKB<->WKT conversion ourselves
54+
3. Raise explicit errors when conversion is needed
55+
56+
**Rationale**: Adding heavy dependencies (Shapely/GEOS) contradicts the design constraint. Implementing our own converter would be complex and error-prone. We choose to:
57+
58+
- Support writing geometry/geography as WKB bytes (Avro/Parquet)
59+
- Raise `NotImplementedError` for JSON single-value serialization until a strategy for WKT conversion is established
60+
61+
**Spec Citations**:
62+
63+
- "Avro" section: geometry/geography are bytes in WKB format
64+
- "JSON Single-Value Serialization" section: geometry/geography as WKT strings
65+
66+
**Reviewer Concerns Anticipated**: Limited functionality initially - JSON value serialization will raise errors until WKB<->WKT conversion is implemented (likely via optional dependency in future).
67+
68+
---
69+
70+
## Decision 4: Avro Mapping
71+
72+
**Decision**: Map geometry and geography to Avro `"bytes"` type, consistent with BinaryType handling.
73+
74+
**Alternatives Considered**:
75+
76+
1. Use fixed-size bytes (not appropriate - geometries are variable size)
77+
2. Use logical type annotation (Avro doesn't have standard geo logical types)
78+
79+
**Rationale**: Per spec, geometry/geography values are stored as WKB bytes. The simplest and most compatible approach is to use Avro's bytes type.
80+
81+
**Spec Citations**: "Avro" section specifies bytes representation.
82+
83+
**Reviewer Concerns Anticipated**: None - this follows the established pattern for binary data.
84+
85+
---
86+
87+
## Decision 5: PyArrow/Parquet Logical Types
88+
89+
**Decision**:
90+
91+
- When `geoarrow-pyarrow` is available, use GeoArrow WKB extension types with CRS/edge metadata
92+
- Without `geoarrow-pyarrow`, fall back to binary columns
93+
- Keep this optional to avoid forcing extra runtime dependencies
94+
95+
**Alternatives Considered**:
96+
97+
1. Require GeoArrow-related dependencies for all users (too restrictive)
98+
2. Always use binary with metadata in Parquet (loses GEO logical type benefit)
99+
3. Refuse to write entirely on old versions (too restrictive)
100+
101+
**Rationale**: Optional dependency keeps base installs lightweight while enabling richer interoperability when users opt in.
102+
103+
**Spec Citations**:
104+
105+
- "Parquet" section: physical binary with logical type GEOMETRY/GEOGRAPHY (WKB)
106+
107+
**Reviewer Concerns Anticipated**: Users may not realize they're losing metadata on older PyArrow - addressed with warning logging.
108+
109+
---
110+
111+
## Decision 6: Type String Format
112+
113+
**Decision**:
114+
115+
- `geometry` with default CRS serializes as `"geometry"`
116+
- `geometry("EPSG:4326")` with non-default CRS serializes as `"geometry('EPSG:4326')"`
117+
- `geography` with default CRS/algorithm serializes as `"geography"`
118+
- `geography("EPSG:4326", "planar")` serializes as `"geography('EPSG:4326', 'planar')"`
119+
120+
**Alternatives Considered**:
121+
122+
1. Always include all parameters
123+
2. Use different delimiters
124+
125+
**Rationale**: Matches existing patterns like `decimal(p, s)` and `fixed[n]`. Omitting defaults makes common cases cleaner.
126+
127+
**Spec Citations**: Type string representations in spec examples.
128+
129+
**Reviewer Concerns Anticipated**: Parsing complexity - addressed with regex patterns similar to DecimalType.
130+
131+
---
132+
133+
## Decision 7: No Spatial Predicate Pushdown
134+
135+
**Decision**: Spatial predicate APIs (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) are supported in expression modeling and binding, but row-level execution and pushdown remain unimplemented.
136+
137+
**Alternatives Considered**:
138+
139+
1. Implement basic bounding-box based pushdown
140+
2. Full spatial predicate support
141+
142+
**Rationale**: This delivers stable expression interoperability first while avoiding incomplete execution semantics.
143+
144+
**Spec Citations**: N/A - this is a performance optimization, not a spec requirement.
145+
146+
**Reviewer Concerns Anticipated**: Users may expect spatial queries - documented as limitation.
147+
148+
---
149+
150+
## Decision 8: Geospatial Bounds Metrics
151+
152+
**Decision**: Implement geometry/geography bounds metrics by parsing WKB values and serializing Iceberg geospatial bound bytes.
153+
154+
**Alternatives Considered**:
155+
156+
1. Implement point bounds (xmin, ymin, zmin, mmin) / (xmax, ymax, zmax, mmax)
157+
2. Return `None` for bounds
158+
159+
**Rationale**: Spec-compliant bounds are required for geospatial metrics interoperability and future predicate pruning. Implementation is dependency-free at runtime and handles antimeridian wrap for geography.
160+
161+
**Spec Citations**:
162+
163+
- "Data file metrics bounds" section
164+
165+
**Reviewer Concerns Anticipated**: WKB parsing complexity and malformed-input handling - addressed by conservative fallback (skip bounds for malformed values and log warnings).
166+
167+
---
168+
169+
## Decision 9: GeoArrow Planar Ambiguity Handling
170+
171+
**Decision**: At the Arrow/Parquet schema-compatibility boundary only, treat `geometry` as compatible with `geography(..., "planar")` when CRS strings match.
172+
173+
**Alternatives Considered**:
174+
175+
1. Always decode ambiguous `geoarrow.wkb` as geometry
176+
2. Always decode ambiguous `geoarrow.wkb` as geography(planar)
177+
3. Relax schema compatibility globally
178+
179+
**Rationale**: GeoArrow WKB metadata without explicit edge semantics does not distinguish `geometry` from `geography(planar)`. Boundary-only compatibility avoids false negatives during import/add-files flows while preserving strict type checks in core schema logic elsewhere.
180+
181+
**Spec Citations**: N/A (this is interoperability behavior for an ambiguous external encoding).
182+
183+
**Reviewer Concerns Anticipated**: Potentially hides geometry-vs-planar-geography mismatch at Arrow/Parquet boundary. Guardrail: only planar equivalence is allowed; spherical geography remains strict.
184+
185+
---
186+
187+
## Summary of Limitations
188+
189+
1. **No WKB<->WKT conversion**: JSON single-value serialization raises NotImplementedError
190+
2. **No spatial predicate execution/pushdown**: API and binding exist, execution/pushdown are future enhancements
191+
3. **GeoArrow metadata optional**: Without `geoarrow-pyarrow`, Parquet uses binary representation without GeoArrow extension metadata
192+
4. **GeoArrow planar ambiguity**: Boundary compatibility treats `geometry` and `geography(planar)` as equivalent with matching CRS
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# PR Draft: Geospatial Compatibility, Metrics, and Spatial Expression Support
2+
3+
## Summary
4+
5+
This PR extends PyIceberg geospatial support in three areas:
6+
7+
1. Adds geospatial bounds metric computation from WKB values (geometry + geography).
8+
2. Adds spatial predicate expression/binding support (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) with conservative evaluator behavior.
9+
3. Improves Arrow/Parquet interoperability for GeoArrow WKB, including explicit handling of geometry vs planar-geography ambiguity at the schema-compatibility boundary.
10+
11+
This increment is compatibility-first and does **not** introduce new runtime dependencies.
12+
13+
## Why
14+
15+
Base `geometry`/`geography` types existed, but there were still practical gaps:
16+
17+
- Geospatial columns were not contributing spec-encoded bounds in data-file metrics.
18+
- Spatial predicates were not modeled end-to-end in expression binding/visitor plumbing.
19+
- GeoArrow metadata can be ambiguous for `geometry` vs `geography(..., "planar")`, causing false compatibility failures during import/add-files flows.
20+
21+
## What Changed
22+
23+
### 1) Geospatial bounds metrics from WKB
24+
25+
- Added pure-Python geospatial utilities in `pyiceberg/utils/geospatial.py`:
26+
- WKB envelope extraction
27+
- antimeridian-aware geography envelope merge
28+
- Iceberg geospatial bound serialization/deserialization
29+
- Added `GeospatialStatsAggregator` and geospatial aggregate helpers in `pyiceberg/io/pyarrow.py`.
30+
- Updated write/import paths to compute geospatial bounds from actual row values (not Parquet binary min/max stats):
31+
- `write_file(...)`
32+
- `parquet_file_to_data_file(...)`
33+
- Prevented incorrect partition inference from geospatial envelope bounds.
34+
35+
### 2) Spatial predicate expression support
36+
37+
- Added expression types in `pyiceberg/expressions/__init__.py`:
38+
- `STContains`, `STIntersects`, `STWithin`, `STOverlaps`
39+
- bound counterparts and JSON parsing support
40+
- Added visitor dispatch/plumbing in `pyiceberg/expressions/visitors.py`.
41+
- Behavior intentionally conservative in this increment:
42+
- row-level expression evaluator raises `NotImplementedError`
43+
- manifest/metrics evaluators return conservative might-match defaults
44+
- translation paths preserve spatial predicates where possible
45+
46+
### 3) GeoArrow/Parquet compatibility improvements
47+
48+
- Added GeoArrow WKB decoding helper in `pyiceberg/io/pyarrow.py` to map extension metadata to Iceberg geospatial types.
49+
- Added boundary-only compatibility option in `pyiceberg/schema.py`:
50+
- `_check_schema_compatible(..., allow_planar_geospatial_equivalence=False)`
51+
- Enabled that option only in `_check_pyarrow_schema_compatible(...)` to allow:
52+
- `geometry` <-> `geography(..., "planar")` when CRS strings match
53+
- while still rejecting spherical geography mismatches
54+
- Added one-time warning log when `geoarrow-pyarrow` is unavailable and code falls back to binary.
55+
56+
## Docs
57+
58+
- Updated user docs: `mkdocs/docs/geospatial.md`
59+
- Added decisions record: `mkdocs/docs/dev/geospatial-types-decisions-v1.md`
60+
61+
## Test Coverage
62+
63+
Added/updated tests across:
64+
65+
- `tests/utils/test_geospatial.py`
66+
- `tests/io/test_pyarrow_stats.py`
67+
- `tests/io/test_pyarrow.py`
68+
- `tests/expressions/test_spatial_predicates.py`
69+
- `tests/integration/test_geospatial_integration.py`
70+
71+
Coverage includes:
72+
73+
- geospatial bound encoding/decoding (XY/XYZ/XYM/XYZM)
74+
- geography antimeridian behavior
75+
- geospatial metrics generation from write/import paths
76+
- spatial predicate modeling/binding/translation behavior
77+
- planar ambiguity compatibility guardrails
78+
- warning behavior for missing `geoarrow-pyarrow`
79+
80+
## Compatibility and Scope Notes
81+
82+
- No user-facing API removals.
83+
- New compatibility relaxation is intentionally scoped to Arrow/Parquet schema-compatibility boundary only.
84+
- Core schema/type compatibility remains strict elsewhere.
85+
- No spatial pushdown/row execution implementation in this PR.
86+
87+
## Follow-ups (Out of Scope Here)
88+
89+
- Spatial predicate execution semantics.
90+
- Spatial predicate pushdown/pruning.
91+
- Runtime WKB <-> WKT conversion strategy.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# How To Review: Geospatial Compatibility, Metrics, and Expressions PR
2+
3+
## Goal
4+
5+
This PR is large because it spans expression APIs, Arrow/Parquet conversion, metrics generation, and documentation.
6+
Recommended strategy: review in focused passes by concern, not file order.
7+
8+
## Recommended Review Order
9+
10+
1. **Decisions and intended behavior**
11+
- `mkdocs/docs/dev/geospatial-types-decisions-v1.md`
12+
- Confirm the intended policy, especially Decision 9 (planar ambiguity boundary handling).
13+
14+
2. **Core geospatial utility correctness**
15+
- `pyiceberg/utils/geospatial.py`
16+
- `tests/utils/test_geospatial.py`
17+
- Focus on envelope extraction, antimeridian behavior, and bound encoding formats.
18+
19+
3. **Metrics integration and write/import paths**
20+
- `pyiceberg/io/pyarrow.py`
21+
- `tests/io/test_pyarrow_stats.py`
22+
- Focus on:
23+
- geospatial bounds derived from row WKB values
24+
- skipping Parquet binary min/max for geospatial columns
25+
- partition inference not using geospatial envelope bounds
26+
27+
4. **GeoArrow compatibility and ambiguity boundary**
28+
- `pyiceberg/schema.py`
29+
- `pyiceberg/io/pyarrow.py`
30+
- `tests/io/test_pyarrow.py`
31+
- Confirm:
32+
- planar equivalence enabled only at Arrow/Parquet boundary
33+
- spherical mismatch still fails
34+
- fallback warning when GeoArrow dependency is absent
35+
36+
5. **Spatial expression surface area**
37+
- `pyiceberg/expressions/__init__.py`
38+
- `pyiceberg/expressions/visitors.py`
39+
- `tests/expressions/test_spatial_predicates.py`
40+
- Confirm:
41+
- bind-time type checks (geometry/geography only)
42+
- visitor plumbing is complete
43+
- conservative evaluator behavior is explicit and documented
44+
45+
6. **User-facing docs**
46+
- `mkdocs/docs/geospatial.md`
47+
- Check limitations and behavior notes match implementation.
48+
49+
## High-Risk Areas To Inspect Closely
50+
51+
1. **Boundary scope leakage**
52+
- Ensure planar-equivalence relaxation is not enabled globally.
53+
54+
2. **Envelope semantics**
55+
- Geography antimeridian cases (`xmin > xmax`) are expected and intentional.
56+
57+
3. **Metrics correctness**
58+
- Geospatial bounds are serialized envelopes, not raw value min/max.
59+
60+
4. **Conservative evaluator behavior**
61+
- Spatial predicates should not accidentally become strict in metrics/manifest evaluators.
62+
63+
## Quick Validation Commands
64+
65+
```bash
66+
uv run --extra hive --extra bigquery python -m pytest tests/utils/test_geospatial.py -q
67+
uv run --extra hive --extra bigquery python -m pytest tests/io/test_pyarrow_stats.py -k "geospatial or planar_geography_schema or partition_inference_skips_geospatial_bounds" -q
68+
uv run --extra hive --extra bigquery python -m pytest tests/io/test_pyarrow.py -k "geoarrow or planar_geography_geometry_equivalence or spherical_geography_geometry_equivalence or logs_warning_once" -q
69+
uv run --extra hive --extra bigquery python -m pytest tests/expressions/test_spatial_predicates.py tests/expressions/test_visitors.py -k "spatial or translate_column_names" -q
70+
```
71+
72+
## Review Outcome Checklist
73+
74+
1. Geometry/geography bounds are present and correctly encoded for write/import paths.
75+
2. `geometry` vs `geography(planar)` is only equivalent at Arrow/Parquet compatibility boundary with CRS equality.
76+
3. `geography(spherical)` remains incompatible with `geometry`.
77+
4. Spatial predicates are correctly modeled/bound; execution and pushdown remain intentionally unimplemented.
78+
5. Missing GeoArrow dependency degrades gracefully with explicit warning.
79+
6. Docs match implemented behavior and limitations.

0 commit comments

Comments
 (0)