feat(schema): Add update_schema action to enable table schema updates#2120
feat(schema): Add update_schema action to enable table schema updates#2120tomighita wants to merge 9 commits intoapache:mainfrom
update_schema action to enable table schema updates#2120Conversation
|
I'm not sure if this is the right direction, exposing add_fields alone in transaction layer seems very limiting. Maybe UpdateSchema can help with your use case? |
|
Hi @CTTY, this could be an option, indeed. Do you know if this has any chance to be merged? It seems to have no activity for the past 6 months. |
9988555 to
10f287a
Compare
add_fields action to enable table schema updatesupdate_schema action to enable table schema updates
|
Updated the PR to add an This also duplicates the work of #1172, but it updates it to match the new Transaction API. |
|
Also closes #697 |
| initial_default: Literal, | ||
| ) -> Self { | ||
| self.add_field(Arc::new( | ||
| NestedField::required(0, name, field_type).with_initial_default(initial_default), |
There was a problem hiding this comment.
According to the Schema Evolution section in the spec:
Struct evolution requires the following rules for default values:
- The
initial-defaultmust be set when a field is added and cannot change- The
write-defaultmust be set when a field is added and may change- When a required field is added, both defaults must be set to a non-null value
| NestedField::required(0, name, field_type).with_initial_default(initial_default), | |
| NestedField::required(0, name, field_type).with_initial_default(initial_default).with_write_default(initial_default), |
This is also done by the Java implementation in SchemaUpdate.internalAddColumn.
There was a problem hiding this comment.
IIUC by only setting with_initial_default, readers will fill a value in on reads, but writers won't add it to new files.
There was a problem hiding this comment.
Good catch! I will need to update other occurrences as well.
755a08b to
86e5142
Compare
| write_default: field.write_default.clone(), | ||
| }) | ||
| } | ||
| Type::Map(m) => { |
There was a problem hiding this comment.
The Java and pyiceberg implementations explicitly forbid deletion of map keys+values. IIUC this code just silently ignores them.
TBH I don't think this is very clear from the spec... apparently
Any struct, including a top-level schema, can evolve through deleting fields [...]
means any struct but no maps. So an explicit error is probably appropriate (same with list type btw)
| if parent_struct | ||
| .fields() | ||
| .iter() | ||
| .any(|f| f.name == pending.field.name && !delete_ids.contains(&f.id)) |
There was a problem hiding this comment.
Maybe also add a check for !deleted_ids.contains(&parent_id)?
DerGut
left a comment
There was a problem hiding this comment.
Thanks for the work! I think this PR looks pretty solid! 👏
I left two nits on edge cases but leaving a 🟢 stamp here.
Which issue does this PR close?
What changes are included in this PR?
This PR creates an
UpdateSchemawhich implements theTransactionActionallowing users of the crate to add new fields to or delete fields from an iceberg table by updating the table schema. It also checks that the added fields are either optional or have a default value to avoid data corruption downstream.Are these changes tested?
UpdateSchema.