Skip to content

Conversation

@rluvaton
Copy link
Member

@rluvaton rluvaton commented Oct 15, 2025

Which issue does this PR close?

Rationale for this change

Allowing to combine BooleanBuffers without a lot of copies and more (see issue)

What changes are included in this PR?

Created most of Buffer ops that exists in arrow-buffer/src/buffer/ops.rs for MutableBuffer and BooleanBufferBuilder
because we can't create BitChunksMut due to the reasons described below I had to port those to the mutable ops code

Implementation notes

Why there is a trait for MutableOpsBufferSupportedLhs and not getting MutableBuffer like the Buffer ops get Buffer

Because then we wouldn't be able to do an operation (e.g. AND) on a subset (e.g. from bit 10 to bit 100) of a BooleanBufferBuilder because BooleanBufferBuilder does not expose MutableBuffer and I don't want to expose it as the user could then add some values that will affect the BooleanBufferBuilder length without updating the length

Why there is a trait for BufferSupportedRhs and not getting Buffer like the Buffer ops get Buffer

Because we want to be able to do MutableBuffer & Buffer and also MutableBuffer & MutableBuffer

Why not creating BitChunksMut for MutableBuffer and making the code be like Buffer which is very simple ops

At first I thought of implementing BitChunksMut for MutableBuffer like and implement the ops the same way that it was implemented for Buffer but saw that it was impossible as:

  1. I might get a bit offset to do the op from that is between 2 u64 and I can't get a reference for that
  2. We read each u64 and convert them to little endian as bit-packed buffers are stored starting with the least-significant byte first.
  3. can't get mutable value for the remainder of the bytes (len % 64)

Are these changes tested?

Yes, although I did not run them on big endian machine

Are there any user-facing changes?

Yes, new functions which are documented


I will later change BooleanBufferBuilder#append_packed_range function to use mutable_bitwise_bin_op_helper as I saw that running the boolean_append_packed benchmark improved by more than 2 times

boolean_append_packed   time:   [2.0079 µs 2.0139 µs 2.0202 µs]
                        change: [−57.808% −57.653% −57.494%] (p = 0.00 < 0.05)
                        Performance has improved.

…table.

but I don't want to pass slice of bytes as then I don't know the source and users must make sure that they hold the same promises as Buffer/MutableBuffer
@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 15, 2025
@alamb
Copy link
Contributor

alamb commented Oct 16, 2025

I will try and review this one tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I haven't made it through this PR yet but the idea of optimized bitwise operations even for offset data is very compelling. The code is also very well tested and documented in my opinion. Thank you.

My primary concern is with the complexity of this code (including the unsafe) though your tests and documentation make it much easier to contemplate. I did have a few comments so far. I think with some more study I could find

Can you please share the benchmarks you are using / any WIP? I want to confirm the performance improvements before studying this code in more detail

FYI @tustvold and @crepererum and @jhorstmann if you are interested

/// (e.g. `BooleanBufferBuilder`).
///
/// ## Why this trait is needed, can't we just use `MutableBuffer` directly?
/// Sometimes we don't want to expose the inner `MutableBuffer`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this rationale. It seems to me that this code does expose the inner MutableBuffer for BooleanBufferBuilder (other code can modify the MutableBuffer) it just does so via a trait. I am not sure how that is different than just passing in mutable buffer directly

I wonder why you can't just pass &mut [u8] (aka pass in the mutable slices directly) as none of the APIs seem to change the length of the underlying buffers 🤔

if it is absolutely required to use a MutableBuffer directly from BooleanBufferBuilder perhaps we can make an unsafe API instead:

impl BooleanBufferBuilder {

/// returns a mutable reference to the buffer and length. Callers must ensure if they change the length
/// the buffer, they also update len
pub unsafe fn inner(&mut self) -> (&mut MutableBuffer, &mut usize) { ... }
}

🤔

Copy link
Member Author

@rluvaton rluvaton Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you see it exposing mutable buffer? It only expose the slice

And not passing bytes to be similar to buffer ops and to make sure that user understand they need to be bit packed but don't have strong opinions about the last thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you see it exposing mutable buffer? It only expose the slice

I was thinking of this code in particular, which seems to pass a MutableBuffer reference directly out of the BooleanBufferBuilder

impl MutableOpsBufferSupportedLhs for BooleanBufferBuilder {
    fn inner_mutable_buffer(&mut self) -> &mut MutableBuffer {
        &mut self.buffer
    }
}

Copy link
Member Author

@rluvaton rluvaton Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this is pub(crate) on purpose (documented on the trait level) to not expose it further than current crate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal which I think is simpler and easier to understand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought about adding unsafe function but did not want to change it.

anyway, modified to be like your function

.map(|(l, r)| expected_op(*l, *r))
.collect();

super::mutable_bitwise_bin_op_helper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice test


let is_mutable_buffer_byte_aligned = left_bit_offset == 0;

if is_mutable_buffer_byte_aligned {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth special casing the case where both left_offset and right_offset are zero? In that case a simple loop that compared u64 by u64 is probably fastest (maybe even u128 🤔 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to be u128 we would want to change the function to get u128, but I wanted to keep similar API to Buffer ops.

do you think we should change it?

and the special case can be added later


impl BitAndAssign<&BooleanBuffer> for BooleanBufferBuilder {
fn bitand_assign(&mut self, rhs: &BooleanBuffer) {
assert_eq!(self.len, rhs.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to document somewhere that using the bitwise operators on BooleanBuffer/Builders with the different lengths will panic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented

@rluvaton
Copy link
Member Author

I will later change BooleanBufferBuilder#append_packed_range function to use mutable_bitwise_bin_op_helper as I saw that running the boolean_append_packed benchmark improved by 57%


boolean_append_packed   time:   [2.0079 µs 2.0139 µs 2.0202 µs]

                        change: [−57.808% −57.653% −57.494%] (p = 0.00 < 0.05)

                        Performance has improved.

You can change the code that I described

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

I plan to spend more time studying this PR tomorrow morning with a fresh pair of eyes

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

I am hoping to spend more time tomorrow reviewing this one carefully (specifically I want to use the new API and see some performance improvements)

@alamb
Copy link
Contributor

alamb commented Oct 24, 2025

I will later change BooleanBufferBuilder#append_packed_range function to use mutable_bitwise_bin_op_helper as I saw that running the boolean_append_packed benchmark improved by 57%


boolean_append_packed   time:   [2.0079 µs 2.0139 µs 2.0202 µs]

                        change: [−57.808% −57.653% −57.494%] (p = 0.00 < 0.05)

                        Performance has improved.

You can change the code that I described

I tried this and for some reason it fails the tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @rluvaton -- this is a feature that has been sorely missing in arrow-rs for a long time (optimized mutable bitwise operations with shifts) ❤️ ❤️ ❤️

Specifically, I think it will also assist operations where a selection mask must be continually narrowed down (aka applying AND)

I went through the low level implementations carefully and they look pretty good to me -- it was very nicely structured and commented. -- so while it took me a while it was a pleasant task.

One thing I would really like to update is the public facing APIs -- namely avoid the new traits, and make the APIs safer to use.

  • Move mutable_bitwise_unary_op_helper to a method on MutableBuffer
  • Move mutable_bitwise_bin_op_helper to a method on MutableBuffer
  • Remove the pub mutable_... functions and move their implementations in BufferBuilder
  • Move tests to be in terms of the bitop impls in MutableBuffer

I sketched how this would look in this PR:

I am happy to push commits to this PR to do so, but I wanted to check with you first.

I also think we need to show some significant performance improvements to justify adding this level of complexity / unsafe to the arrow-crate. I tried to get one here

But it isn't passing tests yet 🤔

) where
F: FnMut(u64, u64) -> u64,
{
// Must not reach here if we not byte aligned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR, it may be worth adding another special case for when the right is also byte aligned which I think will be a common case when applying operations in place on BooleanBuffers (I am looking ahead not just the current usecase)

///
/// This is the only place that uses unsafe code to read and write unaligned
///
struct U64UnalignedSlice<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential (future) optimization is to handle any unaligned bytes and then switch to aligned 64 operations and then clean up with unaligned bytes.

We would have to have some benchmarks to see if it was worthwhile

// to preserve the bits outside the remainder
let rem = {
// 1. Read the byte that we will override
let current = start_remainder_mut_slice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only reads 8 bits, but the comments (and logic) says that the remainder could be up to 63 bits (shouldn't this be reading multiple bytes if remainder_len is greater than 8? Perhaps via `get_remainder_bits 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment, the slice is always at the length of ceil(reminder_len, 8) to make it that the last byte is always the boundary byte

@rluvaton
Copy link
Member Author

rluvaton commented Oct 27, 2025

@alamb I've cleaned up and changed it to be similar to what you had in mind (note that only the public function were added as part of the impl MutableBuffer)

@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

Thank you -- I am about of of time today for reviews, but will conitnue tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add bitwise ops on BooleanBufferBuilder and MutableBuffer that mutate directly the buffer

2 participants