feat(transaction): add RowDelta transaction action for row-level modi…#2203
feat(transaction): add RowDelta transaction action for row-level modi…#2203wirybeaver wants to merge 2 commits intoapache:mainfrom
Conversation
…fications This commit implements the core transaction infrastructure for MERGE INTO, UPDATE, and DELETE operations in Apache Iceberg-Rust. Based on the official Iceberg Java implementation (RowDelta API). **New file: `crates/iceberg/src/transaction/row_delta.rs`** - RowDeltaAction: Transaction action supporting both data file additions and deletions in a single snapshot - add_data_files(): Add new data files (inserts/rewrites in COW mode) - remove_data_files(): Mark data files as deleted (COW mode) - add_delete_files(): Reserved for future Merge-on-Read (MOR) support - validate_from_snapshot(): Conflict detection for concurrent modifications - RowDeltaOperation: Implements SnapshotProduceOperation trait - Determines operation type (Append/Delete/Overwrite) based on changes - Generates DELETED manifest entries for removed files - Carries forward existing manifests for unchanged data **Modified: `crates/iceberg/src/transaction/mod.rs`** - Add row_delta() method to Transaction API - Export row_delta module **Modified: `crates/iceberg/src/transaction/snapshot.rs`** - Add write_delete_manifest() to write DELETED manifest entries - Update manifest_file() to process delete entries from SnapshotProduceOperation - Update validation to allow delete-only operations Comprehensive unit tests with ~85% coverage: - test_row_delta_add_only: Pure append operation - test_row_delta_remove_only: Delete-only operation - test_row_delta_add_and_remove: COW update (remove old, add new) - test_row_delta_with_snapshot_properties: Custom snapshot properties - test_row_delta_validate_from_snapshot: Snapshot validation logic - test_row_delta_empty_action: Empty operation error handling - test_row_delta_incompatible_partition_value: Partition validation All existing tests pass (1135 passed; 0 failed). Copy-on-Write (COW) Strategy: - For row-level modifications: read target files, apply changes, write new files, mark old files deleted - For inserts: write new data files - Merge-on-Read (MOR) with delete files is reserved for future optimization References: - Java implementation: org.apache.iceberg.RowDelta, BaseRowDelta - Based on implementation plan for MERGE INTO support
There was a problem hiding this comment.
This looks great, thanks!
The implementation refer to the official Iceberg Java implementation (RowDelta API).
I'm not sure whether you want to link to the specific commit of iceberg that you're referring to here as well. That way, anyone doing some digging around in PRs at a later date will know where this came from. Not a huge deal though 👍
The tests here are really useful to demonstrate this functionality as well.
The CI issues are for typos, but it is a false-positive because MOR is standing for Merge-On-Read. So you'll need to flag that as okay.
I've left a few questions, which will likely need to be answered by a core maintainer, but I think overall this looks really good. So it is ready for a full maintainer review 💯
| assert!(result.is_err()); | ||
|
|
||
| // Verify the error message mentions snapshot validation | ||
| if let Err(e) = result { | ||
| assert!( | ||
| e.to_string().contains("stale snapshot") || e.to_string().contains("Cannot commit") | ||
| ); |
There was a problem hiding this comment.
Is there a specific error you can assert to here with result.unwrap_err?
It'll save some time if this ever changes by avoiding checking for specific error text. I don't think it is much of an issue if not though.
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
These are some really great tests, good stuff 🎉
| if delete_entries.is_empty() { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| "No delete entries found when write a delete manifest file", |
There was a problem hiding this comment.
| "No delete entries found when write a delete manifest file", | |
| "No delete entries found when writing a delete manifest file", |
I believe this is the intention?
| /// Delete files to add (reserved for future MOR mode support) | ||
| added_delete_files: Vec<DataFile>, |
|
|
||
| let snapshot_producer = SnapshotProducer::new( | ||
| table, | ||
| self.commit_uuid.unwrap_or_else(Uuid::now_v7), |
There was a problem hiding this comment.
It looks like some parts of the code use Uuid::v4 and others are using v7.
Perhaps a question to core maintainers as to whether this matters much? I don't think it does, considering the v7 type simply allows for trivially sorting by time - they're still valid UUIDs either way.
Maybe something to keep in mind.
| ); | ||
|
|
||
| // Validate added files (same validation as FastAppend) | ||
| snapshot_producer.validate_added_data_files()?; |
There was a problem hiding this comment.
As far as I can tell, this is only looking at validating the self.added_data_files, but we're also including some DataFiles here for self.removed_data_files.
Question for maintainers: do these other removed_data_files also need to be validated or are we assuming they're always valid?
It seems to me like the Java impl has logic for validating/skipping delete validation
| /// | ||
| /// Logic matches Java implementation in BaseRowDelta: | ||
| /// - Only adds data files (no deletes, no removes) → Append | ||
| /// - Only adds delete files → Delete |
There was a problem hiding this comment.
This Operation::Delete variant is missing from this function.
Do I understand rightly that this is reserved for future delete files? Or was this missed out
d270bf1 to
1a3944c
Compare
Issue: #2202, a fundamental step of the epic #2201
This PR aims to implements the core transaction RowDeltaAction for MERGE INTO, UPDATE, and DELETE operations in Apache Iceberg-Rust for CoW strategy. The implementation refer to the official Iceberg Java implementation (RowDelta API).