|
| 1 | +# Geospatial Types Implementation Decisions |
| 2 | + |
| 3 | +This document records design decisions made during the implementation of Iceberg v3 geospatial primitive types (`geometry` and `geography`) in PyIceberg. |
| 4 | + |
| 5 | +## Decision 1: Type Parameters Storage |
| 6 | + |
| 7 | +**Decision**: Store CRS (coordinate reference system) and algorithm as string fields in the type classes, with defaults matching the Iceberg specification. |
| 8 | + |
| 9 | +**Alternatives Considered**: |
| 10 | + |
| 11 | +1. Store as tuple in `root` field (like DecimalType does with precision/scale) |
| 12 | +2. Store as separate class attributes with ClassVar defaults |
| 13 | + |
| 14 | +**Rationale**: Using explicit fields with defaults allows for proper serialization/deserialization with Pydantic while maintaining singleton behavior for default-configured types. The spec defines defaults as CRS=`"OGC:CRS84"` and algorithm=`"spherical"` for geography. |
| 15 | + |
| 16 | +**Spec Citations**: |
| 17 | + |
| 18 | +- "Primitive Types" section: `geometry(C)` and `geography(C, A)` type definitions |
| 19 | +- Default CRS is `"OGC:CRS84"`, default algorithm is `"spherical"` |
| 20 | + |
| 21 | +**Reviewer Concerns Anticipated**: Singleton pattern may not work correctly with parameterized types - addressed by inheriting from Singleton and implementing `__getnewargs__` properly. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Decision 2: Format Version Enforcement |
| 26 | + |
| 27 | +**Decision**: Enforce `minimum_format_version() = 3` for both GeometryType and GeographyType. |
| 28 | + |
| 29 | +**Alternatives Considered**: |
| 30 | + |
| 31 | +1. Allow in format version 2 with a warning |
| 32 | +2. Allow without restriction |
| 33 | + |
| 34 | +**Rationale**: Geospatial types are defined as Iceberg v3 features. Allowing them in earlier versions would break spec compliance and interoperability with other Iceberg implementations. |
| 35 | + |
| 36 | +**Spec Citations**: These types are defined in the v3 specification. |
| 37 | + |
| 38 | +**Reviewer Concerns Anticipated**: Users on v2 tables cannot use geospatial types - this is expected behavior per spec. |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## Decision 3: WKB/WKT Boundary |
| 43 | + |
| 44 | +**Decision**: |
| 45 | + |
| 46 | +- Data files use WKB (Well-Known Binary) - stored as `bytes` at runtime |
| 47 | +- JSON single-value serialization uses WKT (Well-Known Text) strings |
| 48 | +- Currently, we do NOT implement WKB<->WKT conversion (requires external dependencies like Shapely/GEOS) |
| 49 | + |
| 50 | +**Alternatives Considered**: |
| 51 | + |
| 52 | +1. Add optional Shapely dependency for conversion |
| 53 | +2. Implement basic WKB<->WKT conversion ourselves |
| 54 | +3. Raise explicit errors when conversion is needed |
| 55 | + |
| 56 | +**Rationale**: Adding heavy dependencies (Shapely/GEOS) contradicts the design constraint. Implementing our own converter would be complex and error-prone. We choose to: |
| 57 | + |
| 58 | +- Support writing geometry/geography as WKB bytes (Avro/Parquet) |
| 59 | +- Raise `NotImplementedError` for JSON single-value serialization until a strategy for WKT conversion is established |
| 60 | + |
| 61 | +**Spec Citations**: |
| 62 | + |
| 63 | +- "Avro" section: geometry/geography are bytes in WKB format |
| 64 | +- "JSON Single-Value Serialization" section: geometry/geography as WKT strings |
| 65 | + |
| 66 | +**Reviewer Concerns Anticipated**: Limited functionality initially - JSON value serialization will raise errors until WKB<->WKT conversion is implemented (likely via optional dependency in future). |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## Decision 4: Avro Mapping |
| 71 | + |
| 72 | +**Decision**: Map geometry and geography to Avro `"bytes"` type, consistent with BinaryType handling. |
| 73 | + |
| 74 | +**Alternatives Considered**: |
| 75 | + |
| 76 | +1. Use fixed-size bytes (not appropriate - geometries are variable size) |
| 77 | +2. Use logical type annotation (Avro doesn't have standard geo logical types) |
| 78 | + |
| 79 | +**Rationale**: Per spec, geometry/geography values are stored as WKB bytes. The simplest and most compatible approach is to use Avro's bytes type. |
| 80 | + |
| 81 | +**Spec Citations**: "Avro" section specifies bytes representation. |
| 82 | + |
| 83 | +**Reviewer Concerns Anticipated**: None - this follows the established pattern for binary data. |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +## Decision 5: PyArrow/Parquet Logical Types |
| 88 | + |
| 89 | +**Decision**: |
| 90 | + |
| 91 | +- When `geoarrow-pyarrow` is available, use GeoArrow WKB extension types with CRS/edge metadata |
| 92 | +- Without `geoarrow-pyarrow`, fall back to binary columns |
| 93 | +- Keep this optional to avoid forcing extra runtime dependencies |
| 94 | + |
| 95 | +**Alternatives Considered**: |
| 96 | + |
| 97 | +1. Require GeoArrow-related dependencies for all users (too restrictive) |
| 98 | +2. Always use binary with metadata in Parquet (loses GEO logical type benefit) |
| 99 | +3. Refuse to write entirely on old versions (too restrictive) |
| 100 | + |
| 101 | +**Rationale**: Optional dependency keeps base installs lightweight while enabling richer interoperability when users opt in. |
| 102 | + |
| 103 | +**Spec Citations**: |
| 104 | + |
| 105 | +- "Parquet" section: physical binary with logical type GEOMETRY/GEOGRAPHY (WKB) |
| 106 | + |
| 107 | +**Reviewer Concerns Anticipated**: Users may not realize they're losing metadata on older PyArrow - addressed with warning logging. |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Decision 6: Type String Format |
| 112 | + |
| 113 | +**Decision**: |
| 114 | + |
| 115 | +- `geometry` with default CRS serializes as `"geometry"` |
| 116 | +- `geometry("EPSG:4326")` with non-default CRS serializes as `"geometry('EPSG:4326')"` |
| 117 | +- `geography` with default CRS/algorithm serializes as `"geography"` |
| 118 | +- `geography("EPSG:4326", "planar")` serializes as `"geography('EPSG:4326', 'planar')"` |
| 119 | + |
| 120 | +**Alternatives Considered**: |
| 121 | + |
| 122 | +1. Always include all parameters |
| 123 | +2. Use different delimiters |
| 124 | + |
| 125 | +**Rationale**: Matches existing patterns like `decimal(p, s)` and `fixed[n]`. Omitting defaults makes common cases cleaner. |
| 126 | + |
| 127 | +**Spec Citations**: Type string representations in spec examples. |
| 128 | + |
| 129 | +**Reviewer Concerns Anticipated**: Parsing complexity - addressed with regex patterns similar to DecimalType. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Decision 7: No Spatial Predicate Pushdown |
| 134 | + |
| 135 | +**Decision**: Spatial predicate APIs (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) are supported in expression modeling and binding, but row-level execution and pushdown remain unimplemented. |
| 136 | + |
| 137 | +**Alternatives Considered**: |
| 138 | + |
| 139 | +1. Implement basic bounding-box based pushdown |
| 140 | +2. Full spatial predicate support |
| 141 | + |
| 142 | +**Rationale**: This delivers stable expression interoperability first while avoiding incomplete execution semantics. |
| 143 | + |
| 144 | +**Spec Citations**: N/A - this is a performance optimization, not a spec requirement. |
| 145 | + |
| 146 | +**Reviewer Concerns Anticipated**: Users may expect spatial queries - documented as limitation. |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +## Decision 8: Geospatial Bounds Metrics |
| 151 | + |
| 152 | +**Decision**: Implement geometry/geography bounds metrics by parsing WKB values and serializing Iceberg geospatial bound bytes. |
| 153 | + |
| 154 | +**Alternatives Considered**: |
| 155 | + |
| 156 | +1. Implement point bounds (xmin, ymin, zmin, mmin) / (xmax, ymax, zmax, mmax) |
| 157 | +2. Return `None` for bounds |
| 158 | + |
| 159 | +**Rationale**: Spec-compliant bounds are required for geospatial metrics interoperability and future predicate pruning. Implementation is dependency-free at runtime and handles antimeridian wrap for geography. |
| 160 | + |
| 161 | +**Spec Citations**: |
| 162 | + |
| 163 | +- "Data file metrics bounds" section |
| 164 | + |
| 165 | +**Reviewer Concerns Anticipated**: WKB parsing complexity and malformed-input handling - addressed by conservative fallback (skip bounds for malformed values and log warnings). |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## Decision 9: GeoArrow Planar Ambiguity Handling |
| 170 | + |
| 171 | +**Decision**: At the Arrow/Parquet schema-compatibility boundary only, treat `geometry` as compatible with `geography(..., "planar")` when CRS strings match. |
| 172 | + |
| 173 | +**Alternatives Considered**: |
| 174 | + |
| 175 | +1. Always decode ambiguous `geoarrow.wkb` as geometry |
| 176 | +2. Always decode ambiguous `geoarrow.wkb` as geography(planar) |
| 177 | +3. Relax schema compatibility globally |
| 178 | + |
| 179 | +**Rationale**: GeoArrow WKB metadata without explicit edge semantics does not distinguish `geometry` from `geography(planar)`. Boundary-only compatibility avoids false negatives during import/add-files flows while preserving strict type checks in core schema logic elsewhere. |
| 180 | + |
| 181 | +**Spec Citations**: N/A (this is interoperability behavior for an ambiguous external encoding). |
| 182 | + |
| 183 | +**Reviewer Concerns Anticipated**: Potentially hides geometry-vs-planar-geography mismatch at Arrow/Parquet boundary. Guardrail: only planar equivalence is allowed; spherical geography remains strict. |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## Summary of Limitations |
| 188 | + |
| 189 | +1. **No WKB<->WKT conversion**: JSON single-value serialization raises NotImplementedError |
| 190 | +2. **No spatial predicate execution/pushdown**: API and binding exist, execution/pushdown are future enhancements |
| 191 | +3. **GeoArrow metadata optional**: Without `geoarrow-pyarrow`, Parquet uses binary representation without GeoArrow extension metadata |
| 192 | +4. **GeoArrow planar ambiguity**: Boundary compatibility treats `geometry` and `geography(planar)` as equivalent with matching CRS |
0 commit comments