feat(schema): Add `update_schema` action to enable table schema updates by tomighita · Pull Request #2120 · apache/iceberg-rust

tomighita · 2026-02-06T15:32:07Z

Which issue does this PR close?

Closes Add support for adding new fields to an iceberg table #2119.

What changes are included in this PR?

This PR creates an UpdateSchema which implements the TransactionAction allowing users of the crate to add new fields to or delete fields from an iceberg table by updating the table schema. It also checks that the added fields are either optional or have a default value to avoid data corruption downstream.

Are these changes tested?

Unit tests for UpdateSchema.
Adds one integration tests which tests that we can add data, update the table and can still read the data, ensuring backwards compatiblity and one integration test for deleting a schema.

CTTY · 2026-02-06T18:33:07Z

I'm not sure if this is the right direction, exposing add_fields alone in transaction layer seems very limiting. Maybe UpdateSchema can help with your use case?

tomighita · 2026-02-09T07:53:01Z

Hi @CTTY, this could be an option, indeed. Do you know if this has any chance to be merged? It seems to have no activity for the past 6 months.
I would not mind implementing it myself using the new tx api, just wanted to make sure this would actually get through if it were in a good shape.

tomighita · 2026-02-10T12:43:34Z

Updated the PR to add an UpdateSchema instead, based on the Java implementation.
Currently, it does not support all actions but we can build on top of this PR to add them as needed.

This also duplicates the work of #1172, but it updates it to match the new Transaction API.

tomighita · 2026-02-10T16:12:29Z

Also closes #697

DerGut · 2026-03-09T11:51:04Z

+        initial_default: Literal,
+    ) -> Self {
+        self.add_field(Arc::new(
+            NestedField::required(0, name, field_type).with_initial_default(initial_default),


According to the Schema Evolution section in the spec:

Struct evolution requires the following rules for default values:

The initial-default must be set when a field is added and cannot change

The write-default must be set when a field is added and may change

When a required field is added, both defaults must be set to a non-null value

Suggested change

NestedField::required(0, name, field_type).with_initial_default(initial_default),

NestedField::required(0, name, field_type).with_initial_default(initial_default).with_write_default(initial_default),

This is also done by the Java implementation in SchemaUpdate.internalAddColumn.

IIUC by only setting with_initial_default, readers will fill a value in on reads, but writers won't add it to new files.

Good catch! I will need to update other occurrences as well.

Awesome, thanks!

DerGut · 2026-03-09T13:16:12Z

+                write_default: field.write_default.clone(),
+            })
+        }
+        Type::Map(m) => {


The Java and pyiceberg implementations explicitly forbid deletion of map keys+values. IIUC this code just silently ignores them.

TBH I don't think this is very clear from the spec... apparently

Any struct, including a top-level schema, can evolve through deleting fields [...]

means any struct but no maps. So an explicit error is probably appropriate (same with list type btw)

I get your point, but this is not exactly what this method does. This method "recomputes" the parents, taking into account newly added fields and deleted fields, together. If we would add an error for map and list types here, we'd essentially block all updates for any struct which contains map or list at any level deep, which is undesired.

What we could do instead is add a separate check explicitly checking that all the deleted fields are either of type struct or primitive. Thoughts on this @DerGut?

DerGut · 2026-03-09T13:35:49Z

+                    if parent_struct
+                        .fields()
+                        .iter()
+                        .any(|f| f.name == pending.field.name && !delete_ids.contains(&f.id))


Maybe also add a check for !deleted_ids.contains(&parent_id)?

Good point, added!

DerGut

Thanks for the work! I think this PR looks pretty solid! 👏

I left two nits on edge cases but leaving a 🟢 stamp here.

blackmwk

Thanks @tomighita for this pr, just finished first round of review.

blackmwk · 2026-03-13T03:37:12Z

+    // --- Root-level additions ---
+
+    /// Add a `NestedFieldRef` column to the table root.
+    pub fn add_field(self, field: NestedFieldRef) -> Self {


We should add some comments to explain that the field id is ignored for now.

blackmwk · 2026-03-13T03:43:33Z

+
+    /// Disable automatic field ID assignment. When disabled, the placeholder IDs
+    /// provided in builder methods are used as-is.
+    pub fn disable_id_auto_assignment(mut self) -> Self {


I don't think we should provide this action, it's quite dangerous.

While I see this can be a problem, I feel like some use cases may benefit from the freedom of overriding the default id assignment. One such instance is why I made this PR in the first place.

If you still believe this should not be here, please lmk, but I would be in favour of keeping a mechanism to control id assignment

I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.

blackmwk · 2026-03-13T03:56:22Z

We should avoid adding it as much as possible. You could add ut using MemoryCatalog

I think you are referring to unit tests here? Not sure I follow, should i not use a rest client in integration tests?

I think you could do two things:

Use MemoryCatalog for unit test.

Add a catalog test suit in crates/catalog/loader/tests as others.

I see. Will try to change these tests then

tomighita · 2026-03-16T14:39:34Z

Thanks for the review @blackmwk! I implemented the api change you suggested

blackmwk

Thanks @tomighita for this pr!

blackmwk · 2026-03-17T02:04:23Z

+// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
+const DEFAULT_ID: i32 = 0;
+
+#[derive(TypedBuilder)]


nit: We should put comments above derived

blackmwk · 2026-03-18T01:03:06Z

+    }
+
+    /// Return a copy with an updated parent path.
+    pub fn with_parent(mut self, parent: impl ToString) -> Self {


Why we need to provide this method? I think it has been covered by generated builder? If you want with_ prefix, typed builder already provides ways to do it.

blackmwk · 2026-03-18T01:03:14Z

+    }
+
+    /// Return a copy with an updated doc string.
+    pub fn with_doc(mut self, doc: impl ToString) -> Self {


blackmwk · 2026-03-18T01:13:36Z

+
+    /// Disable automatic field ID assignment. When disabled, the placeholder IDs
+    /// provided in builder methods are used as-is.
+    pub fn disable_id_auto_assignment(mut self) -> Self {


I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.

blackmwk · 2026-03-18T01:25:53Z

+            let pending_field = add.to_nested_field();
+
+            // Check that name does not contain ".".
+            if pending_field.name.contains('.') {


We should make . a constant in schema module.

blackmwk · 2026-03-18T01:48:11Z

+    match field.field_type.as_ref() {
+        Type::Primitive(_) => field.clone(),
+        Type::Struct(s) => {
+            let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id);


nit: We could add a flag to see if it's unchanged to save a new creation of field.

I would say that returning a (NestedFieldRef, bool) or Option<NestedFieldRef> to denote if something changed is not such a great API. Adding a column is also not on the hot path of iceberg, so i am not sure if it is worth it to optimise for this case.
Any thoughts?

I'm fine with leaving as it is. It's not a public api, so I'm fine with it.

blackmwk · 2026-03-18T01:55:43Z

+        };
+
+        // The new field should have ID = last_column_id + 1 = 4.
+        let new_field = new_schema


This is difficult to read, we could construct an expected schema, and check their equality.

tomighita · 2026-03-20T13:17:46Z

Thanks @blackmwk for the second round of review. I've added your suggestions and even moved the integration tests.
As for the disable auto assign id part: we built a platform where we have our own schema registry. It converts data from proto to arrow and then parquet. Because of that, we want to control the assingment of ids based on how the proto definiton evolves.

LLDay · 2026-03-24T09:14:54Z

@tomighita, thank you for your patch! That's what I was looking for. Could you tell about the flow of renaming the column name? As far as I understand, we need to either provide a separate method for this (e.g. rename_column), or provide the field id in the AddColumn builder.

tomighita · 2026-03-25T08:02:51Z

@LLDay I think it would be a lot cleaner to build on top of this PR and build a newly dedicated rename_column method, similar to other implementations (like you mentioned). The reason I did not add it in this PR is because I wanted to keep things relatively small/contained to make it easier to review.

That's a good call: before refactoring this PR, you could add a FieldRef directly, which after implementing @blackmwk's suggestion I don't think is possible any longer. In this case, we can even remove the disable_id_auto_assignment method.

However, I would personally be in favor of keeping this and allowing again the user to specify these type of operations where they want more control of assigning ids.

Anyone any thoughts?

…t PR apache#2120) Adds Transaction::update_schema() for programmatic schema evolution. Cherry-picked onto v0.9.0 tag from tomighita's upstream PR.

LLDay · 2026-04-13T13:38:57Z

@tomighita, hi again! I've got a review of your code from @kryvashek. You can look at the comments to apply fixes for this PR.

blackmwk

Thanks @tomighita for this pr, and sorry for late reply. I left some comments, and I still have concerns about the auto assign id.

blackmwk · 2026-04-02T07:36:59Z

+/// Sentinel parent ID representing the table root (top-level columns).
+const TABLE_ROOT_ID: i32 = -1;
+// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
+const DEFAULT_ID: i32 = 0;


Suggested change

const DEFAULT_ID: i32 = 0;

const DEFAULT_FIELD_ID: i32 = 0;

blackmwk · 2026-04-16T02:03:22Z

+    fields: &[NestedFieldRef],
+    adds: &HashMap<i32, Vec<NestedFieldRef>>,
+    delete_ids: &HashSet<i32>,
+    root_id: i32,


What's the problem of that?

blackmwk · 2026-04-16T09:05:46Z

+    match field.field_type.as_ref() {
+        Type::Primitive(_) => field.clone(),
+        Type::Struct(s) => {
+            let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id);


I'm fine with leaving as it is. It's not a public api, so I'm fine with it.

blackmwk · 2026-04-16T09:20:31Z

+            .to_vec();
+        expected_fields
+            .push(NestedField::optional(4, "new_col", Type::Primitive(PrimitiveType::Int)).into());
+        let expected_schema = Schema::builder()


We have a into_builder method to create a SchemaBuilder directly from Schema

blackmwk · 2026-04-16T09:27:57Z

Please follow other tests' name convention and rename it schema_update_suite. And please take other tests as examples how to do integration tests against all catalogs.

add update_schema

10f287a

tomighita force-pushed the tomighita/feat-add-fields-table-action branch from 9988555 to 10f287a Compare February 10, 2026 12:38

tomighita changed the title ~~feat(schema): Add add_fields action to enable table schema updates~~ feat(schema): Add update_schema action to enable table schema updates Feb 10, 2026

Cargo format update

6829c7f

tomighita and others added 2 commits February 10, 2026 14:52

Fix clippy!

0e5d3e6

Merge branch 'main' into tomighita/feat-add-fields-table-action

3176078

Merge branch 'main' into tomighita/feat-add-fields-table-action

3410293

DerGut reviewed Mar 9, 2026

View reviewed changes

Add write_default to new fields

86e5142

tomighita requested a review from DerGut March 9, 2026 12:33

tomighita force-pushed the tomighita/feat-add-fields-table-action branch from 755a08b to 86e5142 Compare March 9, 2026 12:36

tomighita and others added 3 commits March 9, 2026 14:38

Merge branch 'main' into tomighita/feat-add-fields-table-action

f05f046

Fix new FileIO api

cc92962

Add integration test storage factory

22e7bb8

DerGut reviewed Mar 9, 2026

View reviewed changes

DerGut approved these changes Mar 9, 2026

View reviewed changes

blackmwk reviewed Mar 13, 2026

View reviewed changes

tomighita added 2 commits March 16, 2026 14:09

add extra delete check for parent

35b0dfa

Refactor AddColumn API

2216f3c

blackmwk reviewed Mar 18, 2026

View reviewed changes

tomighita added 3 commits March 20, 2026 14:57

implement nits

c3aa21b

Update unit tests

d680462

Move integration tests

d134608

run formatter

7bf9dda

tomighita requested a review from blackmwk March 20, 2026 13:25

blackmwk reviewed Apr 16, 2026

View reviewed changes

	NestedField::required(0, name, field_type).with_initial_default(initial_default),
	NestedField::required(0, name, field_type).with_initial_default(initial_default).with_write_default(initial_default),

Conversation

tomighita commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY commented Feb 6, 2026

Uh oh!

tomighita commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomighita commented Feb 10, 2026

Uh oh!

tomighita commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DerGut left a comment

Choose a reason for hiding this comment

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented Mar 16, 2026

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented Feb 6, 2026 •

edited

Loading

tomighita commented Feb 9, 2026 •

edited

Loading