Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_valid and truncate methods to NullBufferBuilder #7013

Merged
merged 14 commits into from
Jan 27, 2025
71 changes: 70 additions & 1 deletion arrow-buffer/src/builder/null.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ impl NullBufferBuilder {
/// Creates a new builder from a `MutableBuffer`.
pub fn new_from_buffer(buffer: MutableBuffer, len: usize) -> Self {
let capacity = buffer.len() * 8;

assert!(len <= capacity);

let bitmap_builder = Some(BooleanBufferBuilder::new_from_buffer(buffer, len));
Expand Down Expand Up @@ -134,6 +133,42 @@ impl NullBufferBuilder {
}
}

/// Returns the capacity of the buffer
alamb marked this conversation as resolved.
Show resolved Hide resolved
#[inline]
pub fn capacity(&self) -> usize {
if let Some(buf) = self.bitmap_builder.as_ref() {
buf.capacity()
} else {
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need to reference self.capacity here instead of 0?

Copy link
Contributor Author

@Chen-Yuan-Lai Chen-Yuan-Lai Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure when bitmap_builder hasn't been initialized, we should return the actual capacity of it (that is 0 because no builder existed) or self.capacity.

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion it would feel weird if I initialize a NullBufferBuilder with a capacity:

let nbb = NullBufferBuilder::new(10)

But then checking the capacity right after would result in it saying 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a bit strange, but I think the PR as written makes the most sense.

Specifically, downstream in DataFusion (and other places) we use the capacity as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.

I think we can make this less confusing with some comments. I left some suggestions

Another thing that might help might be to rename it to fn allocated_capacity() to make the difference more explicit 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn allocated_capacity() would make sense, but it seems we already have that:

/// Return the allocated size of this builder, in bytes, useful for memory accounting.
pub fn allocated_size(&self) -> usize {
self.bitmap_builder
.as_ref()
.map(|b| b.capacity())
.unwrap_or(0)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rename capacity field to initial_capacity or initial_bits to clearly indicate this is for initialization only?

pub fn new(capacity: usize) -> Self {
      Self {
          bitmap_builder: None,
          len: 0,
          initial_capacity,
      }
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add a capacity() method given allocated_size() seems to already do what is required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed capacity method in 8ed7851

}
}

/// Gets a bit in the buffer at `index`
#[inline]
pub fn is_valid(&mut self, index: usize) -> bool {
Jefffrey marked this conversation as resolved.
Show resolved Hide resolved
if let Some(buf) = self.bitmap_builder.as_mut() {
buf.get_bit(index)
} else {
true
}
}

/// Truncates the builder to the given length
///
/// If `len` is greater than the buffer's current length, this has no effect
#[inline]
pub fn truncate(&mut self, len: usize) {
if len > self.len {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we materialize, is self.len accurate anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.

This behavior seems to be consistent with Vec::truncate so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this test case:

    #[test]
    fn test123() {
        let mut builder = NullBufferBuilder::new(0);
        assert_eq!(builder.len(), 0);
        builder.append_n_nulls(2);
        assert_eq!(builder.len(), 2);
        builder.truncate(1);
        assert_eq!(builder.len(), 1); // fails here
    }

It would fail at the last assertion because it was materialized after appending two nulls, but then truncating down to 1 is a noop since the internal self.len stays 0 (not updated after materialization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are (of course) correct. This is a great test and find

I took the liberty of pushing a commit to this branch that includes this test case (and will fail CI until it is fixed so block this PR from merging)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0d3c62 and added some more documentation on the relationship between len and builder


if let Some(buf) = self.bitmap_builder.as_mut() {
buf.truncate(len);
} else {
self.len = len
}
}

/// Appends a boolean slice into the builder
/// to indicate the validations of these items.
pub fn append_slice(&mut self, slice: &[bool]) {
Expand Down Expand Up @@ -218,6 +253,7 @@ mod tests {
builder.append_n_nulls(2);
builder.append_n_non_nulls(2);
assert_eq!(6, builder.len());
assert_eq!(512, builder.capacity());

let buf = builder.finish().unwrap();
assert_eq!(&[0b110010_u8], buf.validity());
Expand All @@ -230,6 +266,7 @@ mod tests {
builder.append_n_nulls(2);
builder.append_slice(&[false, false, false]);
assert_eq!(6, builder.len());
assert_eq!(512, builder.capacity());

let buf = builder.finish().unwrap();
assert_eq!(&[0b0_u8], buf.validity());
Expand All @@ -242,6 +279,7 @@ mod tests {
builder.append_n_non_nulls(2);
builder.append_slice(&[true, true, true]);
assert_eq!(6, builder.len());
assert_eq!(0, builder.capacity());

let buf = builder.finish();
assert!(buf.is_none());
Expand All @@ -263,4 +301,35 @@ mod tests {
let buf = builder.finish().unwrap();
assert_eq!(&[0b1011_u8], buf.validity());
}

#[test]
fn test_null_buffer_builder_is_valid() {
let mut builder = NullBufferBuilder::new(0);
builder.append_n_non_nulls(6);
assert!(builder.is_valid(0));

builder.append_null();
assert!(!builder.is_valid(6));

builder.append_non_null();
assert!(builder.is_valid(7));
}

#[test]
fn test_null_buffer_builder_truncate() {
let mut builder = NullBufferBuilder::new(10);
builder.append_n_non_nulls(16);
assert_eq!(builder.as_slice(), None);
builder.truncate(20);
assert_eq!(builder.as_slice(), None);
assert_eq!(builder.len(), 16);
assert_eq!(builder.capacity(), 0);
builder.truncate(14);
assert_eq!(builder.as_slice(), None);
assert_eq!(builder.len(), 14);
builder.append_null();
builder.append_non_null();
assert_eq!(builder.as_slice().unwrap(), &[0xFF, 0b10111111]);
assert_eq!(builder.capacity(), 512);
}
}
Loading