Skip to content

Commit 181226a

Browse files
johanl-dbtdas
andauthored
Accept Type Widening RFC (#4094)
## Description The type widening table feature has a stable implementation in delta-spark since Delta 3.2 an in Kernel Java/Rust. This PR proposes to accept the RFC specification of the feature and add it to the Delta protocol. Type Widening feature request: #2623 Related: transitioning table feature from preview to stable: #4094 (comment) --------- Co-authored-by: Tathagata Das <[email protected]>
1 parent 8561aaf commit 181226a

File tree

3 files changed

+176
-6
lines changed

3 files changed

+176
-6
lines changed

PROTOCOL.md

Lines changed: 137 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1028,6 +1028,7 @@ When supported and active, writers must:
10281028
- Block replacing partitioned tables with a differently-named partition spec
10291029
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_b INT` must be blocked
10301030
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_a LONG` is allowed
1031+
- When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.
10311032

10321033
# Iceberg Compatibility V2
10331034

@@ -1054,6 +1055,7 @@ When this feature is supported and enabled, writers must:
10541055
- Block replacing partitioned tables with a differently-named partition spec
10551056
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_b INT` must be blocked
10561057
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_a LONG` is allowed
1058+
- When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.
10571059

10581060
### Example of storing identifiers for nested fields in ArrayType and MapType
10591061
The following is an example of storing the identifiers for nested fields in `ArrayType` and `MapType`, of a table with the following schema,
@@ -1473,6 +1475,140 @@ Furthermore, when attempting timestamp-based time travel where table state must
14731475
1. If `timestamp X` >= `delta.inCommitTimestampEnablementTimestamp`, only table versions >= `delta.inCommitTimestampEnablementVersion` should be considered for the query.
14741476
2. Otherwise, only table versions less than `delta.inCommitTimestampEnablementVersion` should be considered for the query.
14751477

1478+
# Type Widening
1479+
1480+
The Type Widening feature enables changing the type of a column or field in an existing Delta table to a wider type.
1481+
1482+
The supported type changes are:
1483+
- Integer widening:
1484+
- `Byte` -> `Short` -> `Int` -> `Long`
1485+
- Floating-point widening:
1486+
- `Float` -> `Double`
1487+
- `Byte`, `Short` or `Int` -> `Double`
1488+
- Date widening:
1489+
- `Date` -> `Timestamp without timezone`
1490+
- Decimal widening - `p` and `s` denote the decimal precision and scale respectively.
1491+
- `Decimal(p, s)` -> `Decimal(p + k1, s + k2)` where `k1 >= k2 >= 0`.
1492+
- `Byte`, `Short` or `Int` -> `Decimal(10 + k1, k2)` where `k1 >= k2 >= 0`.
1493+
- `Long` -> `Decimal(20 + k1, k2)` where `k1 >= k2 >= 0`.
1494+
1495+
To support this feature:
1496+
- The table must be on Reader version 3 and Writer Version 7.
1497+
- The feature `typeWidening` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`, either during its creation or at a later stage.
1498+
1499+
When supported:
1500+
- A table may have a metadata property `delta.enableTypeWidening` in the Delta schema set to `true`. Writers must reject widening type changes when this property isn't set to `true`.
1501+
- The `metadata` for a column or field in the table schema may contain the key `delta.typeChanges` storing a history of type changes for that column or field.
1502+
1503+
### Type Change Metadata
1504+
1505+
Type changes applied to a table are recorded in the table schema and stored in the `metadata` of their nearest ancestor [StructField](#struct-field) using the key `delta.typeChanges`.
1506+
The value for the key `delta.typeChanges` must be a JSON list of objects, where each object contains the following fields:
1507+
Field Name | optional/required | Description
1508+
-|-|-
1509+
`fromType`| required | The type of the column or field before the type change.
1510+
`toType`| required | The type of the column or field after the type change.
1511+
`fieldPath`| optional | When updating the type of a map key/value or array element only: the path from the struct field holding the metadata to the map key/value or array element that was updated.
1512+
1513+
The `fieldPath` value is "key", "value" and "element" when updating resp. the type of a map key, map value and array element.
1514+
The `fieldPath` value for nested maps and nested arrays are prefixed by their parents's path, separated by dots.
1515+
1516+
The following is an example for the definition of a column that went through two type changes:
1517+
```json
1518+
{
1519+
"name" : "e",
1520+
"type" : "long",
1521+
"nullable" : true,
1522+
"metadata" : {
1523+
"delta.typeChanges": [
1524+
{
1525+
"fromType": "short",
1526+
"toType": "integer"
1527+
},
1528+
{
1529+
"fromType": "integer",
1530+
"toType": "long"
1531+
}
1532+
]
1533+
}
1534+
}
1535+
```
1536+
1537+
The following is an example for the definition of a column after changing the type of a map key:
1538+
```json
1539+
{
1540+
"name" : "e",
1541+
"type" : {
1542+
"type": "map",
1543+
"keyType": "double",
1544+
"valueType": "integer",
1545+
"valueContainsNull": true
1546+
},
1547+
"nullable" : true,
1548+
"metadata" : {
1549+
"delta.typeChanges": [
1550+
{
1551+
"fromType": "float",
1552+
"toType": "double",
1553+
"fieldPath": "key"
1554+
}
1555+
]
1556+
}
1557+
}
1558+
```
1559+
1560+
The following is an example for the definition of a column after changing the type of a map value nested in an array:
1561+
```json
1562+
{
1563+
"name" : "e",
1564+
"type" : {
1565+
"type": "array",
1566+
"elementType": {
1567+
"type": "map",
1568+
"keyType": "string",
1569+
"valueType": "decimal(10, 4)",
1570+
"valueContainsNull": true
1571+
},
1572+
"containsNull": true
1573+
},
1574+
"nullable" : true,
1575+
"metadata" : {
1576+
"delta.typeChanges": [
1577+
{
1578+
"fromType": "decimal(6, 2)",
1579+
"toType": "decimal(10, 4)",
1580+
"fieldPath": "element.value"
1581+
}
1582+
]
1583+
}
1584+
}
1585+
```
1586+
1587+
## Writer Requirements for Type Widening
1588+
1589+
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
1590+
- Writers must reject applying any unsupported type change.
1591+
- Writers must reject applying type changes not supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution)
1592+
when either the [Iceberg Compatibility V1](#iceberg-compatibility-v1) or [Iceberg Compatibility V2](#iceberg-compatibility-v2) table feature is supported:
1593+
- `Byte`, `Short` or `Int` -> `Double`
1594+
- `Date` -> `Timestamp without timezone`
1595+
- Decimal scale increase
1596+
- `Byte`, `Short`, `Int` or `Long` -> `Decimal`
1597+
- Writers must record type change information in the `metadata` of the nearest ancestor [StructField](#struct-field). See [Type Change Metadata](#type-change-metadata).
1598+
- Writers must preserve the `delta.typeChanges` field in the metadata fields in the schema when the table schema is updated.
1599+
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same field types as the table schema.
1600+
1601+
When Type Widening is enabled (when the table property `delta.enableTypeWidening` is set to `true`), then:
1602+
- Writers should allow updating the table schema to apply a supported type change to a column, struct field, map key/value or array element.
1603+
1604+
When removing the Type Widening table feature from the table, in the version that removes `typeWidening` from the `writerFeatures` and `readerFeatures` fields of the table's `protocol` action:
1605+
- Writers must ensure no `delta.typeChanges` metadata key is present in the table schema. This may require rewriting existing data files to ensure that all data files use the same field types as the table schema in order to fulfill the requirement to remove type widening metadata.
1606+
- Writers must ensure that the table property `delta.enableTypeWidening` is not set.
1607+
1608+
## Reader Requirements for Type Widening
1609+
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
1610+
- Readers must allow reading data files written before the table underwent any supported type change, and must convert such values to the current, wider type.
1611+
- Readers must validate that they support all type changes in the `delta.typeChanges` field in the table schema for the table version they are reading and fail when finding any unsupported type change.
14761612

14771613
# Requirements for Writers
14781614
This section documents additional requirements that writers must follow in order to preserve some of the higher level guarantees that Delta provides.
@@ -2068,7 +2204,7 @@ delta.columnMapping.*| These keys are used to store information about the mappin
20682204
delta.identity.*| These keys are for defining identity columns. See [Identity Columns](#identity-columns) for details.
20692205
delta.invariants| JSON string contains SQL expression information. See [Column Invariants](#column-invariants) for details.
20702206
delta.generationExpression| SQL expression string. See [Generated Columns](#generated-columns) for details.
2071-
2207+
delta.typeChanges| JSON string containing information about previous type changes applied to this column. See [Type Change Metadata](#type-change-metadata) for details.
20722208

20732209
### Example
20742210

protocol_rfcs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
1818

1919
| Date proposed | RFC file | Github issue | RFC title |
2020
|:--------------|:---------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:---------------------------------------|
21-
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
2221
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits |
2322
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
2423
| 2023-04-24 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type |
@@ -30,6 +29,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
3029
|:-|:-|:-|:-|:-|
3130
| 2023-02-28 | 2023-03-26 |[vacuum-protocol-check.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/vacuum-protocol-check.md)| https://github.com/delta-io/delta/issues/2630 | Enforce Vacuum Protocol Check |
3231
| 2023-02-02 | 2023-07-24 |[in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps |
32+
| 2023-02-09 | 2025-01-28 |[type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
3333

3434
### Rejected RFCs
3535

protocol_rfcs/type-widening.md renamed to protocol_rfcs/accepted/type-widening.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ The following is an example for the definition of a column after changing the ty
108108
{
109109
"fromType": "decimal(6, 2)",
110110
"toType": "decimal(10, 4)",
111-
"fieldPath": "element.key"
111+
"fieldPath": "element.value"
112112
}
113113
]
114114
}
@@ -117,20 +117,54 @@ The following is an example for the definition of a column after changing the ty
117117

118118
## Writer Requirements for Type Widening
119119

120-
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
120+
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
121121
- Writers must reject applying any unsupported type change.
122+
- Writers must reject applying type changes not supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution)
123+
when either the [Iceberg Compatibility V1](#iceberg-compatibility-v1) or [Iceberg Compatibility V2](#iceberg-compatibility-v2) table feature is supported:
124+
- `Byte`, `Short` or `Int` -> `Double`
125+
- `Date` -> `Timestamp without timezone`
126+
- Decimal scale increase
127+
- `Byte`, `Short`, `Int` or `Long` -> `Decimal`
122128
- Writers must record type change information in the `metadata` of the nearest ancestor [StructField](#struct-field). See [Type Change Metadata](#type-change-metadata).
123129
- Writers must preserve the `delta.typeChanges` field in the metadata fields in the schema when the table schema is updated.
124-
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same column and field types as the table schema.
130+
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same field types as the table schema.
125131

126132
When Type Widening is enabled (when the table property `delta.enableTypeWidening` is set to `true`), then:
127133
- Writers should allow updating the table schema to apply a supported type change to a column, struct field, map key/value or array element.
128134

135+
When removing the Type Widening table feature from the table, in the version that removes `typeWidening` from the `writerFeatures` and `readerFeatures` fields of the table's `protocol` action:
136+
- Writers must ensure no `delta.typeChanges` metadata key is present in the table schema. This may require rewriting existing data files to ensure that all data files use the same field types as the table schema in order to fulfill the requirement to remove type widening metadata.
137+
- Writers must ensure that the table property `delta.enableTypeWidening` is not set.
138+
129139
## Reader Requirements for Type Widening
130-
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
140+
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
131141
- Readers must allow reading data files written before the table underwent any supported type change, and must convert such values to the current, wider type.
132142
- Readers must validate that they support all type changes in the `delta.typeChanges` field in the table schema for the table version they are reading and fail when finding any unsupported type change.
133143

144+
## Writer Requirements for IcebergCompatV1
145+
> ***Change to existing section (underlined)***
146+
147+
When supported and active, writers must:
148+
- Require that Column Mapping be enabled and set to either `name` or `id` mode
149+
- Require that Deletion Vectors are not supported (and, consequently, not active, either). i.e., the `deletionVectors` table feature is not present in the table `protocol`.
150+
- Require that partition column values are materialized into any Parquet data file that is present in the table, placed *after* the data columns in the parquet schema
151+
- Require that all `AddFile`s committed to the table have the `numRecords` statistic populated in their `stats` field
152+
- <ins>When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.<ins>
153+
154+
## Writer Requirements for IcebergCompatV2
155+
> ***Change to existing section (underlined)***
156+
157+
When this feature is supported and enabled, writers must:
158+
- Require that Column Mapping be enabled and set to either `name` or `id` mode
159+
- Require that the nested `element` field of ArrayTypes and the nested `key` and `value` fields of MapTypes be assigned 32 bit integer identifiers. These identifiers must be unique and different from those used in [Column Mapping](#column-mapping), and must be stored in the metadata of their nearest ancestor [StructField](#struct-field) of the Delta table schema. Identifiers belonging to the same `StructField` must be organized as a `Map[String, Long]` and stored in metadata with key `parquet.field.nested.ids`. The keys of the map are "element", "key", or "value", prefixed by the name of the nearest ancestor StructField, separated by dots. The values are the identifiers. The keys for fields in nested arrays or nested maps are prefixed by their parents' key, separated by dots. An [example](#example-of-storing-identifiers-for-nested-fields-in-arraytype-and-maptype) is provided below to demonstrate how the identifiers are stored. These identifiers must be also written to the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift) when writing parquet files.
160+
- Require that IcebergCompatV1 is not active, which means either the `icebergCompatV1` table feature is not present in the table protocol or the table property `delta.enableIcebergCompatV1` is not set to `true`
161+
- Require that Deletion Vectors are not active, which means either the `deletionVectors` table feature is not present in the table protocol or the table property `delta.enableDeletionVectors` is not set to `true`
162+
- Require that partition column values be materialized when writing Parquet data files
163+
- Require that all new `AddFile`s committed to the table have the `numRecords` statistic populated in their `stats` field
164+
- Require writing timestamp columns as int64
165+
- Require that the table schema contains only data types in the following allow-list: [`byte`, `short`, `integer`, `long`, `float`, `double`, `decimal`, `string`, `binary`, `boolean`, `timestamp`, `timestampNTZ`, `date`, `array`, `map`, `struct`].
166+
- <ins>When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.<ins>
167+
134168
### Column Metadata
135169
> ***Change to existing section (underlined)***
136170

0 commit comments

Comments
 (0)