Skip to content

feat(schema): Add update_schema action to enable table schema updates#2120

Open
tomighita wants to merge 15 commits intoapache:mainfrom
dbt-labs:tomighita/feat-add-fields-table-action
Open

feat(schema): Add update_schema action to enable table schema updates#2120
tomighita wants to merge 15 commits intoapache:mainfrom
dbt-labs:tomighita/feat-add-fields-table-action

Conversation

@tomighita
Copy link
Copy Markdown

@tomighita tomighita commented Feb 6, 2026

Which issue does this PR close?

What changes are included in this PR?

This PR creates an UpdateSchema which implements the TransactionAction allowing users of the crate to add new fields to or delete fields from an iceberg table by updating the table schema. It also checks that the added fields are either optional or have a default value to avoid data corruption downstream.

Are these changes tested?

  • Unit tests for UpdateSchema.
  • Adds one integration tests which tests that we can add data, update the table and can still read the data, ensuring backwards compatiblity and one integration test for deleting a schema.

@CTTY
Copy link
Copy Markdown
Collaborator

CTTY commented Feb 6, 2026

I'm not sure if this is the right direction, exposing add_fields alone in transaction layer seems very limiting. Maybe UpdateSchema can help with your use case?

@tomighita
Copy link
Copy Markdown
Author

tomighita commented Feb 9, 2026

Hi @CTTY, this could be an option, indeed. Do you know if this has any chance to be merged? It seems to have no activity for the past 6 months.
I would not mind implementing it myself using the new tx api, just wanted to make sure this would actually get through if it were in a good shape.

@tomighita tomighita force-pushed the tomighita/feat-add-fields-table-action branch from 9988555 to 10f287a Compare February 10, 2026 12:38
@tomighita tomighita changed the title feat(schema): Add add_fields action to enable table schema updates feat(schema): Add update_schema action to enable table schema updates Feb 10, 2026
@tomighita
Copy link
Copy Markdown
Author

Updated the PR to add an UpdateSchema instead, based on the Java implementation.
Currently, it does not support all actions but we can build on top of this PR to add them as needed.

This also duplicates the work of #1172, but it updates it to match the new Transaction API.

@tomighita
Copy link
Copy Markdown
Author

Also closes #697

initial_default: Literal,
) -> Self {
self.add_field(Arc::new(
NestedField::required(0, name, field_type).with_initial_default(initial_default),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Schema Evolution section in the spec:

Struct evolution requires the following rules for default values:

  • The initial-default must be set when a field is added and cannot change
  • The write-default must be set when a field is added and may change
  • When a required field is added, both defaults must be set to a non-null value
Suggested change
NestedField::required(0, name, field_type).with_initial_default(initial_default),
NestedField::required(0, name, field_type).with_initial_default(initial_default).with_write_default(initial_default),

This is also done by the Java implementation in SchemaUpdate.internalAddColumn.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC by only setting with_initial_default, readers will fill a value in on reads, but writers won't add it to new files.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I will need to update other occurrences as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@tomighita tomighita requested a review from DerGut March 9, 2026 12:33
@tomighita tomighita force-pushed the tomighita/feat-add-fields-table-action branch from 755a08b to 86e5142 Compare March 9, 2026 12:36
write_default: field.write_default.clone(),
})
}
Type::Map(m) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java and pyiceberg implementations explicitly forbid deletion of map keys+values. IIUC this code just silently ignores them.

TBH I don't think this is very clear from the spec... apparently

Any struct, including a top-level schema, can evolve through deleting fields [...]

means any struct but no maps. So an explicit error is probably appropriate (same with list type btw)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but this is not exactly what this method does. This method "recomputes" the parents, taking into account newly added fields and deleted fields, together. If we would add an error for map and list types here, we'd essentially block all updates for any struct which contains map or list at any level deep, which is undesired.

What we could do instead is add a separate check explicitly checking that all the deleted fields are either of type struct or primitive. Thoughts on this @DerGut?

if parent_struct
.fields()
.iter()
.any(|f| f.name == pending.field.name && !delete_ids.contains(&f.id))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a check for !deleted_ids.contains(&parent_id)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added!

Copy link
Copy Markdown
Contributor

@DerGut DerGut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I think this PR looks pretty solid! 👏

I left two nits on edge cases but leaving a 🟢 stamp here.

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tomighita for this pr, just finished first round of review.

// --- Root-level additions ---

/// Add a `NestedFieldRef` column to the table root.
pub fn add_field(self, field: NestedFieldRef) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some comments to explain that the field id is ignored for now.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread crates/iceberg/src/transaction/update_schema.rs Outdated

/// Disable automatic field ID assignment. When disabled, the placeholder IDs
/// provided in builder methods are used as-is.
pub fn disable_id_auto_assignment(mut self) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should provide this action, it's quite dangerous.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see this can be a problem, I feel like some use cases may benefit from the freedom of overriding the default id assignment. One such instance is why I made this PR in the first place.

If you still believe this should not be here, please lmk, but I would be in favour of keeping a mechanism to control id assignment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid adding it as much as possible. You could add ut using MemoryCatalog

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are referring to unit tests here? Not sure I follow, should i not use a rest client in integration tests?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do two things:

  1. Use MemoryCatalog for unit test.
  2. Add a catalog test suit in crates/catalog/loader/tests as others.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Will try to change these tests then

@tomighita
Copy link
Copy Markdown
Author

Thanks for the review @blackmwk! I implemented the api change you suggested

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tomighita for this pr!

// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
const DEFAULT_ID: i32 = 0;

#[derive(TypedBuilder)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should put comments above derived

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/// Return a copy with an updated parent path.
pub fn with_parent(mut self, parent: impl ToString) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to provide this method? I think it has been covered by generated builder? If you want with_ prefix, typed builder already provides ways to do it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

/// Return a copy with an updated doc string.
pub fn with_doc(mut self, doc: impl ToString) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


/// Disable automatic field ID assignment. When disabled, the placeholder IDs
/// provided in builder methods are used as-is.
pub fn disable_id_auto_assignment(mut self) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see actual use case before we add it. With actual use case, we can determine if it's the best place to do it.

let pending_field = add.to_nested_field();

// Check that name does not contain ".".
if pending_field.name.contains('.') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make . a constant in schema module.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

match field.field_type.as_ref() {
Type::Primitive(_) => field.clone(),
Type::Struct(s) => {
let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could add a flag to see if it's unchanged to save a new creation of field.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that returning a (NestedFieldRef, bool) or Option<NestedFieldRef> to denote if something changed is not such a great API. Adding a column is also not on the hot path of iceberg, so i am not sure if it is worth it to optimise for this case.
Any thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving as it is. It's not a public api, so I'm fine with it.

Comment thread crates/iceberg/src/transaction/update_schema.rs
Comment thread crates/iceberg/src/transaction/update_schema.rs
Comment thread crates/iceberg/src/transaction/update_schema.rs
};

// The new field should have ID = last_column_id + 1 = 4.
let new_field = new_schema
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is difficult to read, we could construct an expected schema, and check their equality.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tomighita
Copy link
Copy Markdown
Author

Thanks @blackmwk for the second round of review. I've added your suggestions and even moved the integration tests.
As for the disable auto assign id part: we built a platform where we have our own schema registry. It converts data from proto to arrow and then parquet. Because of that, we want to control the assingment of ids based on how the proto definiton evolves.

@tomighita tomighita requested a review from blackmwk March 20, 2026 13:25
@LLDay
Copy link
Copy Markdown
Contributor

LLDay commented Mar 24, 2026

@tomighita, thank you for your patch! That's what I was looking for. Could you tell about the flow of renaming the column name? As far as I understand, we need to either provide a separate method for this (e.g. rename_column), or provide the field id in the AddColumn builder.

@tomighita
Copy link
Copy Markdown
Author

@LLDay I think it would be a lot cleaner to build on top of this PR and build a newly dedicated rename_column method, similar to other implementations (like you mentioned). The reason I did not add it in this PR is because I wanted to keep things relatively small/contained to make it easier to review.

That's a good call: before refactoring this PR, you could add a FieldRef directly, which after implementing @blackmwk's suggestion I don't think is possible any longer. In this case, we can even remove the disable_id_auto_assignment method.

However, I would personally be in favor of keeping this and allowing again the user to specify these type of operations where they want more control of assigning ids.

Anyone any thoughts?

greedAuguria pushed a commit to auguria-io/iceberg-rust that referenced this pull request Apr 5, 2026
…t PR apache#2120)

Adds Transaction::update_schema() for programmatic schema evolution.
Cherry-picked onto v0.9.0 tag from tomighita's upstream PR.
greedAuguria pushed a commit to auguria-io/iceberg-rust that referenced this pull request Apr 5, 2026
…t PR apache#2120)

Adds Transaction::update_schema() for programmatic schema evolution.
Cherry-picked onto v0.9.0 tag from tomighita's upstream PR.
@LLDay
Copy link
Copy Markdown
Contributor

LLDay commented Apr 13, 2026

@tomighita, hi again! I've got a review of your code from @kryvashek. You can look at the comments to apply fixes for this PR.

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tomighita for this pr, and sorry for late reply. I left some comments, and I still have concerns about the auto assign id.

/// Sentinel parent ID representing the table root (top-level columns).
const TABLE_ROOT_ID: i32 = -1;
// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
const DEFAULT_ID: i32 = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const DEFAULT_ID: i32 = 0;
const DEFAULT_FIELD_ID: i32 = 0;

fields: &[NestedFieldRef],
adds: &HashMap<i32, Vec<NestedFieldRef>>,
delete_ids: &HashSet<i32>,
root_id: i32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem of that?

match field.field_type.as_ref() {
Type::Primitive(_) => field.clone(),
Type::Struct(s) => {
let new_fields = rebuild_fields(s.fields(), adds, delete_ids, field.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving as it is. It's not a public api, so I'm fine with it.

.to_vec();
expected_fields
.push(NestedField::optional(4, "new_col", Type::Primitive(PrimitiveType::Int)).into());
let expected_schema = Schema::builder()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a into_builder method to create a SchemaBuilder directly from Schema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow other tests' name convention and rename it schema_update_suite. And please take other tests as examples how to do integration tests against all catalogs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for adding new fields to an iceberg table

5 participants