You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Description
The type widening table feature has a stable implementation in
delta-spark since Delta 3.2 an in Kernel Java/Rust.
This PR proposes to accept the RFC specification of the feature and add
it to the Delta protocol.
Type Widening feature request:
#2623
Related: transitioning table feature from preview to stable:
#4094 (comment)
---------
Co-authored-by: Tathagata Das <[email protected]>
Copy file name to clipboardExpand all lines: PROTOCOL.md
+137-1Lines changed: 137 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1028,6 +1028,7 @@ When supported and active, writers must:
1028
1028
- Block replacing partitioned tables with a differently-named partition spec
1029
1029
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_b INT` must be blocked
1030
1030
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_a LONG` is allowed
1031
+
- When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.
1031
1032
1032
1033
# Iceberg Compatibility V2
1033
1034
@@ -1054,6 +1055,7 @@ When this feature is supported and enabled, writers must:
1054
1055
- Block replacing partitioned tables with a differently-named partition spec
1055
1056
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_b INT` must be blocked
1056
1057
- e.g. replacing a table partitioned by `part_a INT` with partition spec `part_a LONG` is allowed
1058
+
- When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.
1057
1059
1058
1060
### Example of storing identifiers for nested fields in ArrayType and MapType
1059
1061
The following is an example of storing the identifiers for nested fields in `ArrayType` and `MapType`, of a table with the following schema,
@@ -1473,6 +1475,140 @@ Furthermore, when attempting timestamp-based time travel where table state must
1473
1475
1. If `timestamp X` >= `delta.inCommitTimestampEnablementTimestamp`, only table versions >= `delta.inCommitTimestampEnablementVersion` should be considered for the query.
1474
1476
2. Otherwise, only table versions less than `delta.inCommitTimestampEnablementVersion` should be considered for the query.
1475
1477
1478
+
# Type Widening
1479
+
1480
+
The Type Widening feature enables changing the type of a column or field in an existing Delta table to a wider type.
1481
+
1482
+
The supported type changes are:
1483
+
- Integer widening:
1484
+
-`Byte` -> `Short` -> `Int` -> `Long`
1485
+
- Floating-point widening:
1486
+
-`Float` -> `Double`
1487
+
-`Byte`, `Short` or `Int` -> `Double`
1488
+
- Date widening:
1489
+
-`Date` -> `Timestamp without timezone`
1490
+
- Decimal widening - `p` and `s` denote the decimal precision and scale respectively.
1491
+
-`Decimal(p, s)` -> `Decimal(p + k1, s + k2)` where `k1 >= k2 >= 0`.
1492
+
-`Byte`, `Short` or `Int` -> `Decimal(10 + k1, k2)` where `k1 >= k2 >= 0`.
- The table must be on Reader version 3 and Writer Version 7.
1497
+
- The feature `typeWidening` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`, either during its creation or at a later stage.
1498
+
1499
+
When supported:
1500
+
- A table may have a metadata property `delta.enableTypeWidening` in the Delta schema set to `true`. Writers must reject widening type changes when this property isn't set to `true`.
1501
+
- The `metadata` for a column or field in the table schema may contain the key `delta.typeChanges` storing a history of type changes for that column or field.
1502
+
1503
+
### Type Change Metadata
1504
+
1505
+
Type changes applied to a table are recorded in the table schema and stored in the `metadata` of their nearest ancestor [StructField](#struct-field) using the key `delta.typeChanges`.
1506
+
The value for the key `delta.typeChanges` must be a JSON list of objects, where each object contains the following fields:
1507
+
Field Name | optional/required | Description
1508
+
-|-|-
1509
+
`fromType`| required | The type of the column or field before the type change.
1510
+
`toType`| required | The type of the column or field after the type change.
1511
+
`fieldPath`| optional | When updating the type of a map key/value or array element only: the path from the struct field holding the metadata to the map key/value or array element that was updated.
1512
+
1513
+
The `fieldPath` value is "key", "value" and "element" when updating resp. the type of a map key, map value and array element.
1514
+
The `fieldPath` value for nested maps and nested arrays are prefixed by their parents's path, separated by dots.
1515
+
1516
+
The following is an example for the definition of a column that went through two type changes:
1517
+
```json
1518
+
{
1519
+
"name" : "e",
1520
+
"type" : "long",
1521
+
"nullable" : true,
1522
+
"metadata" : {
1523
+
"delta.typeChanges": [
1524
+
{
1525
+
"fromType": "short",
1526
+
"toType": "integer"
1527
+
},
1528
+
{
1529
+
"fromType": "integer",
1530
+
"toType": "long"
1531
+
}
1532
+
]
1533
+
}
1534
+
}
1535
+
```
1536
+
1537
+
The following is an example for the definition of a column after changing the type of a map key:
1538
+
```json
1539
+
{
1540
+
"name" : "e",
1541
+
"type" : {
1542
+
"type": "map",
1543
+
"keyType": "double",
1544
+
"valueType": "integer",
1545
+
"valueContainsNull": true
1546
+
},
1547
+
"nullable" : true,
1548
+
"metadata" : {
1549
+
"delta.typeChanges": [
1550
+
{
1551
+
"fromType": "float",
1552
+
"toType": "double",
1553
+
"fieldPath": "key"
1554
+
}
1555
+
]
1556
+
}
1557
+
}
1558
+
```
1559
+
1560
+
The following is an example for the definition of a column after changing the type of a map value nested in an array:
1561
+
```json
1562
+
{
1563
+
"name" : "e",
1564
+
"type" : {
1565
+
"type": "array",
1566
+
"elementType": {
1567
+
"type": "map",
1568
+
"keyType": "string",
1569
+
"valueType": "decimal(10, 4)",
1570
+
"valueContainsNull": true
1571
+
},
1572
+
"containsNull": true
1573
+
},
1574
+
"nullable" : true,
1575
+
"metadata" : {
1576
+
"delta.typeChanges": [
1577
+
{
1578
+
"fromType": "decimal(6, 2)",
1579
+
"toType": "decimal(10, 4)",
1580
+
"fieldPath": "element.value"
1581
+
}
1582
+
]
1583
+
}
1584
+
}
1585
+
```
1586
+
1587
+
## Writer Requirements for Type Widening
1588
+
1589
+
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
1590
+
- Writers must reject applying any unsupported type change.
1591
+
- Writers must reject applying type changes not supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution)
1592
+
when either the [Iceberg Compatibility V1](#iceberg-compatibility-v1) or [Iceberg Compatibility V2](#iceberg-compatibility-v2) table feature is supported:
1593
+
-`Byte`, `Short` or `Int` -> `Double`
1594
+
-`Date` -> `Timestamp without timezone`
1595
+
- Decimal scale increase
1596
+
-`Byte`, `Short`, `Int` or `Long` -> `Decimal`
1597
+
- Writers must record type change information in the `metadata` of the nearest ancestor [StructField](#struct-field). See [Type Change Metadata](#type-change-metadata).
1598
+
- Writers must preserve the `delta.typeChanges` field in the metadata fields in the schema when the table schema is updated.
1599
+
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same field types as the table schema.
1600
+
1601
+
When Type Widening is enabled (when the table property `delta.enableTypeWidening` is set to `true`), then:
1602
+
- Writers should allow updating the table schema to apply a supported type change to a column, struct field, map key/value or array element.
1603
+
1604
+
When removing the Type Widening table feature from the table, in the version that removes `typeWidening` from the `writerFeatures` and `readerFeatures` fields of the table's `protocol` action:
1605
+
- Writers must ensure no `delta.typeChanges` metadata key is present in the table schema. This may require rewriting existing data files to ensure that all data files use the same field types as the table schema in order to fulfill the requirement to remove type widening metadata.
1606
+
- Writers must ensure that the table property `delta.enableTypeWidening` is not set.
1607
+
1608
+
## Reader Requirements for Type Widening
1609
+
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
1610
+
- Readers must allow reading data files written before the table underwent any supported type change, and must convert such values to the current, wider type.
1611
+
- Readers must validate that they support all type changes in the `delta.typeChanges` field in the table schema for the table version they are reading and fail when finding any unsupported type change.
1476
1612
1477
1613
# Requirements for Writers
1478
1614
This section documents additional requirements that writers must follow in order to preserve some of the higher level guarantees that Delta provides.
@@ -2068,7 +2204,7 @@ delta.columnMapping.*| These keys are used to store information about the mappin
2068
2204
delta.identity.*| These keys are for defining identity columns. See [Identity Columns](#identity-columns) for details.
2069
2205
delta.invariants| JSON string contains SQL expression information. See [Column Invariants](#column-invariants) for details.
2070
2206
delta.generationExpression| SQL expression string. See [Generated Columns](#generated-columns) for details.
2071
-
2207
+
delta.typeChanges| JSON string containing information about previous type changes applied to this column. See [Type Change Metadata](#type-change-metadata) for details.
| 2023-04-24 |[variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md)|https://github.com/delta-io/delta/issues/2864| Variant Data Type |
@@ -30,6 +29,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
Copy file name to clipboardExpand all lines: protocol_rfcs/accepted/type-widening.md
+38-4Lines changed: 38 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -108,7 +108,7 @@ The following is an example for the definition of a column after changing the ty
108
108
{
109
109
"fromType": "decimal(6, 2)",
110
110
"toType": "decimal(10, 4)",
111
-
"fieldPath": "element.key"
111
+
"fieldPath": "element.value"
112
112
}
113
113
]
114
114
}
@@ -117,20 +117,54 @@ The following is an example for the definition of a column after changing the ty
117
117
118
118
## Writer Requirements for Type Widening
119
119
120
-
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
120
+
When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
121
121
- Writers must reject applying any unsupported type change.
122
+
- Writers must reject applying type changes not supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution)
123
+
when either the [Iceberg Compatibility V1](#iceberg-compatibility-v1) or [Iceberg Compatibility V2](#iceberg-compatibility-v2) table feature is supported:
124
+
-`Byte`, `Short` or `Int` -> `Double`
125
+
-`Date` -> `Timestamp without timezone`
126
+
- Decimal scale increase
127
+
-`Byte`, `Short`, `Int` or `Long` -> `Decimal`
122
128
- Writers must record type change information in the `metadata` of the nearest ancestor [StructField](#struct-field). See [Type Change Metadata](#type-change-metadata).
123
129
- Writers must preserve the `delta.typeChanges` field in the metadata fields in the schema when the table schema is updated.
124
-
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same column and field types as the table schema.
130
+
- Writers may remove the `delta.typeChanges` metadata in the table schema if all data files use the same field types as the table schema.
125
131
126
132
When Type Widening is enabled (when the table property `delta.enableTypeWidening` is set to `true`), then:
127
133
- Writers should allow updating the table schema to apply a supported type change to a column, struct field, map key/value or array element.
128
134
135
+
When removing the Type Widening table feature from the table, in the version that removes `typeWidening` from the `writerFeatures` and `readerFeatures` fields of the table's `protocol` action:
136
+
- Writers must ensure no `delta.typeChanges` metadata key is present in the table schema. This may require rewriting existing data files to ensure that all data files use the same field types as the table schema in order to fulfill the requirement to remove type widening metadata.
137
+
- Writers must ensure that the table property `delta.enableTypeWidening` is not set.
138
+
129
139
## Reader Requirements for Type Widening
130
-
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
140
+
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `typeWidening`), then:
131
141
- Readers must allow reading data files written before the table underwent any supported type change, and must convert such values to the current, wider type.
132
142
- Readers must validate that they support all type changes in the `delta.typeChanges` field in the table schema for the table version they are reading and fail when finding any unsupported type change.
133
143
144
+
## Writer Requirements for IcebergCompatV1
145
+
> ***Change to existing section (underlined)***
146
+
147
+
When supported and active, writers must:
148
+
- Require that Column Mapping be enabled and set to either `name` or `id` mode
149
+
- Require that Deletion Vectors are not supported (and, consequently, not active, either). i.e., the `deletionVectors` table feature is not present in the table `protocol`.
150
+
- Require that partition column values are materialized into any Parquet data file that is present in the table, placed *after* the data columns in the parquet schema
151
+
- Require that all `AddFile`s committed to the table have the `numRecords` statistic populated in their `stats` field
152
+
- <ins>When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.<ins>
153
+
154
+
## Writer Requirements for IcebergCompatV2
155
+
> ***Change to existing section (underlined)***
156
+
157
+
When this feature is supported and enabled, writers must:
158
+
- Require that Column Mapping be enabled and set to either `name` or `id` mode
159
+
- Require that the nested `element` field of ArrayTypes and the nested `key` and `value` fields of MapTypes be assigned 32 bit integer identifiers. These identifiers must be unique and different from those used in [Column Mapping](#column-mapping), and must be stored in the metadata of their nearest ancestor [StructField](#struct-field) of the Delta table schema. Identifiers belonging to the same `StructField` must be organized as a `Map[String, Long]` and stored in metadata with key `parquet.field.nested.ids`. The keys of the map are "element", "key", or "value", prefixed by the name of the nearest ancestor StructField, separated by dots. The values are the identifiers. The keys for fields in nested arrays or nested maps are prefixed by their parents' key, separated by dots. An [example](#example-of-storing-identifiers-for-nested-fields-in-arraytype-and-maptype) is provided below to demonstrate how the identifiers are stored. These identifiers must be also written to the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift) when writing parquet files.
160
+
- Require that IcebergCompatV1 is not active, which means either the `icebergCompatV1` table feature is not present in the table protocol or the table property `delta.enableIcebergCompatV1` is not set to `true`
161
+
- Require that Deletion Vectors are not active, which means either the `deletionVectors` table feature is not present in the table protocol or the table property `delta.enableDeletionVectors` is not set to `true`
162
+
- Require that partition column values be materialized when writing Parquet data files
163
+
- Require that all new `AddFile`s committed to the table have the `numRecords` statistic populated in their `stats` field
164
+
- Require writing timestamp columns as int64
165
+
- Require that the table schema contains only data types in the following allow-list: [`byte`, `short`, `integer`, `long`, `float`, `double`, `decimal`, `string`, `binary`, `boolean`, `timestamp`, `timestampNTZ`, `date`, `array`, `map`, `struct`].
166
+
- <ins>When the [Type Widening](#type-widening) table feature is supported, require that all type changes applied on the table are supported by [Iceberg V2](https://iceberg.apache.org/spec/#schema-evolution), based on the [Type Change Metadata](#type-change-metadata) recorded in the table schema.<ins>
0 commit comments