feat(schema): Add update_schema action to enable table schema updates#2120
feat(schema): Add update_schema action to enable table schema updates#2120tomighita wants to merge 15 commits intoapache:mainfrom
update_schema action to enable table schema updates#2120Conversation
|
I'm not sure if this is the right direction, exposing add_fields alone in transaction layer seems very limiting. Maybe UpdateSchema can help with your use case? |
|
Hi @CTTY, this could be an option, indeed. Do you know if this has any chance to be merged? It seems to have no activity for the past 6 months. |
9988555 to
10f287a
Compare
add_fields action to enable table schema updatesupdate_schema action to enable table schema updates
|
Updated the PR to add an This also duplicates the work of #1172, but it updates it to match the new Transaction API. |
|
Also closes #697 |
| initial_default: Literal, | ||
| ) -> Self { | ||
| self.add_field(Arc::new( | ||
| NestedField::required(0, name, field_type).with_initial_default(initial_default), |
There was a problem hiding this comment.
According to the Schema Evolution section in the spec:
Struct evolution requires the following rules for default values:
- The
initial-defaultmust be set when a field is added and cannot change- The
write-defaultmust be set when a field is added and may change- When a required field is added, both defaults must be set to a non-null value
| NestedField::required(0, name, field_type).with_initial_default(initial_default), | |
| NestedField::required(0, name, field_type).with_initial_default(initial_default).with_write_default(initial_default), |
This is also done by the Java implementation in SchemaUpdate.internalAddColumn.
There was a problem hiding this comment.
IIUC by only setting with_initial_default, readers will fill a value in on reads, but writers won't add it to new files.
There was a problem hiding this comment.
Good catch! I will need to update other occurrences as well.
755a08b to
86e5142
Compare
| write_default: field.write_default.clone(), | ||
| }) | ||
| } | ||
| Type::Map(m) => { |
There was a problem hiding this comment.
The Java and pyiceberg implementations explicitly forbid deletion of map keys+values. IIUC this code just silently ignores them.
TBH I don't think this is very clear from the spec... apparently
Any struct, including a top-level schema, can evolve through deleting fields [...]
means any struct but no maps. So an explicit error is probably appropriate (same with list type btw)
There was a problem hiding this comment.
I get your point, but this is not exactly what this method does. This method "recomputes" the parents, taking into account newly added fields and deleted fields, together. If we would add an error for map and list types here, we'd essentially block all updates for any struct which contains map or list at any level deep, which is undesired.
What we could do instead is add a separate check explicitly checking that all the deleted fields are either of type struct or primitive. Thoughts on this @DerGut?
| if parent_struct | ||
| .fields() | ||
| .iter() | ||
| .any(|f| f.name == pending.field.name && !delete_ids.contains(&f.id)) |
There was a problem hiding this comment.
Maybe also add a check for !deleted_ids.contains(&parent_id)?
DerGut
left a comment
There was a problem hiding this comment.
Thanks for the work! I think this PR looks pretty solid! 👏
I left two nits on edge cases but leaving a 🟢 stamp here.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @tomighita for this pr, just finished first round of review.
| // --- Root-level additions --- | ||
|
|
||
| /// Add a `NestedFieldRef` column to the table root. | ||
| pub fn add_field(self, field: NestedFieldRef) -> Self { |
There was a problem hiding this comment.
We should add some comments to explain that the field id is ignored for now.
|
|
||
| /// Disable automatic field ID assignment. When disabled, the placeholder IDs | ||
| /// provided in builder methods are used as-is. | ||
| pub fn disable_id_auto_assignment(mut self) -> Self { |
There was a problem hiding this comment.
I don't think we should provide this action, it's quite dangerous.
There was a problem hiding this comment.
While I see this can be a problem, I feel like some use cases may benefit from the freedom of overriding the default id assignment. One such instance is why I made this PR in the first place.
If you still believe this should not be here, please lmk, but I would be in favour of keeping a mechanism to control id assignment
There was a problem hiding this comment.
I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.
There was a problem hiding this comment.
We should avoid adding it as much as possible. You could add ut using MemoryCatalog
There was a problem hiding this comment.
I think you are referring to unit tests here? Not sure I follow, should i not use a rest client in integration tests?
There was a problem hiding this comment.
I think you could do two things:
- Use MemoryCatalog for unit test.
- Add a catalog test suit in
crates/catalog/loader/testsas others.
There was a problem hiding this comment.
I see. Will try to change these tests then
|
Thanks for the review @blackmwk! I implemented the api change you suggested |
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @tomighita for this pr!
| // Default ID for a new column. This will be re-assigned to a fresh ID at commit time. | ||
| const DEFAULT_ID: i32 = 0; | ||
|
|
||
| #[derive(TypedBuilder)] |
There was a problem hiding this comment.
nit: We should put comments above derived
| } | ||
|
|
||
| /// Return a copy with an updated parent path. | ||
| pub fn with_parent(mut self, parent: impl ToString) -> Self { |
There was a problem hiding this comment.
Why we need to provide this method? I think it has been covered by generated builder? If you want with_ prefix, typed builder already provides ways to do it.
| } | ||
|
|
||
| /// Return a copy with an updated doc string. | ||
| pub fn with_doc(mut self, doc: impl ToString) -> Self { |
|
|
||
| /// Disable automatic field ID assignment. When disabled, the placeholder IDs | ||
| /// provided in builder methods are used as-is. | ||
| pub fn disable_id_auto_assignment(mut self) -> Self { |
There was a problem hiding this comment.
I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.
| let pending_field = add.to_nested_field(); | ||
|
|
||
| // Check that name does not contain ".". | ||
| if pending_field.name.contains('.') { |
There was a problem hiding this comment.
We should make . a constant in schema module.
| match field.field_type.as_ref() { | ||
| Type::Primitive(_) => field.clone(), | ||
| Type::Struct(s) => { | ||
| let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id); |
There was a problem hiding this comment.
nit: We could add a flag to see if it's unchanged to save a new creation of field.
There was a problem hiding this comment.
I would say that returning a (NestedFieldRef, bool) or Option<NestedFieldRef> to denote if something changed is not such a great API. Adding a column is also not on the hot path of iceberg, so i am not sure if it is worth it to optimise for this case.
Any thoughts?
There was a problem hiding this comment.
I'm fine with leaving as it is. It's not a public api, so I'm fine with it.
| }; | ||
|
|
||
| // The new field should have ID = last_column_id + 1 = 4. | ||
| let new_field = new_schema |
There was a problem hiding this comment.
This is difficult to read, we could construct an expected schema, and check their equality.
|
Thanks @blackmwk for the second round of review. I've added your suggestions and even moved the integration tests. |
|
@tomighita, thank you for your patch! That's what I was looking for. Could you tell about the flow of renaming the column name? As far as I understand, we need to either provide a separate method for this (e.g. |
|
@LLDay I think it would be a lot cleaner to build on top of this PR and build a newly dedicated That's a good call: before refactoring this PR, you could add a FieldRef directly, which after implementing @blackmwk's suggestion I don't think is possible any longer. In this case, we can even remove the However, I would personally be in favor of keeping this and allowing again the user to specify these type of operations where they want more control of assigning ids. Anyone any thoughts? |
…t PR apache#2120) Adds Transaction::update_schema() for programmatic schema evolution. Cherry-picked onto v0.9.0 tag from tomighita's upstream PR.
…t PR apache#2120) Adds Transaction::update_schema() for programmatic schema evolution. Cherry-picked onto v0.9.0 tag from tomighita's upstream PR.
|
@tomighita, hi again! I've got a review of your code from @kryvashek. You can look at the comments to apply fixes for this PR. |
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @tomighita for this pr, and sorry for late reply. I left some comments, and I still have concerns about the auto assign id.
| /// Sentinel parent ID representing the table root (top-level columns). | ||
| const TABLE_ROOT_ID: i32 = -1; | ||
| // Default ID for a new column. This will be re-assigned to a fresh ID at commit time. | ||
| const DEFAULT_ID: i32 = 0; |
There was a problem hiding this comment.
| const DEFAULT_ID: i32 = 0; | |
| const DEFAULT_FIELD_ID: i32 = 0; |
| fields: &[NestedFieldRef], | ||
| adds: &HashMap<i32, Vec<NestedFieldRef>>, | ||
| delete_ids: &HashSet<i32>, | ||
| root_id: i32, |
There was a problem hiding this comment.
What's the problem of that?
| match field.field_type.as_ref() { | ||
| Type::Primitive(_) => field.clone(), | ||
| Type::Struct(s) => { | ||
| let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id); |
There was a problem hiding this comment.
I'm fine with leaving as it is. It's not a public api, so I'm fine with it.
| .to_vec(); | ||
| expected_fields | ||
| .push(NestedField::optional(4, "new_col", Type::Primitive(PrimitiveType::Int)).into()); | ||
| let expected_schema = Schema::builder() |
There was a problem hiding this comment.
We have a into_builder method to create a SchemaBuilder directly from Schema
There was a problem hiding this comment.
Please follow other tests' name convention and rename it schema_update_suite. And please take other tests as examples how to do integration tests against all catalogs.
Which issue does this PR close?
What changes are included in this PR?
This PR creates an
UpdateSchemawhich implements theTransactionActionallowing users of the crate to add new fields to or delete fields from an iceberg table by updating the table schema. It also checks that the added fields are either optional or have a default value to avoid data corruption downstream.Are these changes tested?
UpdateSchema.