feat: add bulk insertion to deletion vector#1578
Conversation
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @dentiny for this pr, generally looks good!
crates/iceberg/src/delete_vector.rs
Outdated
| pub fn append_positions(&mut self, positions: &[u64]) -> bool { | ||
| let expected_num = positions.len(); | ||
| let appended_num = self.inner.append(positions.iter().copied()).unwrap(); | ||
| appended_num as usize == expected_num |
There was a problem hiding this comment.
This definition seems odd to me, which treats insertation with duplication as a failure. I would suggest to return nothing.
There was a problem hiding this comment.
Thanks for the comment!
Yeah I think it's too specific for my use case (which requires no duplicates and ordering).
In the latest commit, I updated the return value to # of items inserted:
- It matches the interface for roaring bitmap
- For my own usage, I do want to assert no duplicate insertions
Let me know if you're fine with it, thank you for the quick review and constructive comments!
crates/iceberg/src/delete_vector.rs
Outdated
| /// Precondition: The values of the iterator must be ordered and strictly greater than the greatest value in the set. | ||
| /// If a value in the iterator doesn’t satisfy this requirement, it is not added and the append operation is stopped. | ||
| #[allow(dead_code)] | ||
| pub fn insert_positions(&mut self, positions: &[u64]) -> u64 { |
There was a problem hiding this comment.
There are some misunderstading here. What I mean is:
- If the inner
appendfail, we should also return this error. - If we succeeded, we should return the actual number inserted, rather the len of
positions. The difference is, if theinneralready contains some value inpositions, the returned length should substract the common part.
Please add detailed doc for the behavior.
There was a problem hiding this comment.
Also please add some tests for the case where there are intersections.
There was a problem hiding this comment.
If the inner append fail, we should also return this error.
I considered it:
- I met some problems to wrap it inside of iceberg error, for example, I don't want to add another error category for this particular error type and may lose information (successfully inserted rows)
- Returning the number of successfully inserted rows is a well-defined behavior, and it's always caller's responsibility to check whether row insertion works
Another consideration here is whether we want users to keep insertion upon failure:
- Returning an error, without inserted rows being specified, means deletion vector could be at a broken state, and users are not supposed to keep insertion;
- Returning the number of rows inserted indicates clearly which part gets inserted and which parts not, so it's actually legal to keep insertion after updating the input arguments.
There was a problem hiding this comment.
Also please add some tests for the case where there are intersections.
I actually already have certain cases:
iceberg-rust/crates/iceberg/src/delete_vector.rs
Lines 176 to 181 in b297231
Do you think that's enough?
There was a problem hiding this comment.
I updated the doc and impl to return Result, which I wrap iceberg::Result around the returned result;
let me know if you think it makes sense :)
Thank you so much for the explanation and quick review!
1b1127f to
aefed79
Compare
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @dentiny for this pr, LGTM! Just one nit about error handling.
Thank you so much for the careful review! I learnt a lot. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @dentiny for this pr!
## What changes are included in this PR? In this PR, I added a bulk insertion API to deletion vector and roaring bitmap. Context: - I'm working on iceberg-related feature on daily basis, and I'm implementing own deletion vector and puffin blob myself + Code reference: https://github.com/Mooncake-Labs/moonlink/blob/main/src/moonlink/src/storage/iceberg/deletion_vector.rs - I would like to leverage upstream's implementation to reduce re-inventing the wheels, then I noticed some differences + My impl supports bulk insertion, because `append` provides better perf + In my use case, all deleted rows are fetched in ascending order ## Are these changes tested? Yes, unit tests added.
What changes are included in this PR?
In this PR, I added a bulk insertion API to deletion vector and roaring bitmap.
Context:
appendprovides better perfAre these changes tested?
Yes, unit tests added.