Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Oct 17, 2025

Which issue does this PR close?

Part of #5375

Vortex was encountering some issues after we switched our preferred List type to ListView, the first thing we noticed was that arrow_select::filter_array would fail on ListView (and LargeListView, though we don't use that).

This PR addresses some missing select kernel implementations for ListView and LargeListView.

This also fixes an existing bug in the ArrayData validation for ListView arrays that would trigger an out of bounds index panic.

Are these changes tested?

  • filter_array
  • concat
  • take

Are there any user-facing changes?

ListView/LargeListView can now be used with the take, concat and filter_array kernels

You can now use the PartialEq to compare ListView arrays.

a10y added 3 commits October 17, 2025 13:35
Added support for ListView and LargeListView for the following
operations:

* `arrow_select::concat`
* `arrow_select::filter_array`

Signed-off-by: Andrew Duffy <[email protected]>
There was a bug in ArrayData validation for ListView, which became
apparent when you tried to call
`list_view.to_data().into_builder().build().unwrap()`. The range for
checking the offsets/sizes was wrong and would trivially trigger an out
of bounds check.

Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
) -> Result<(), ArrowError> {
let offsets: &[T] = self.typed_buffer(0, self.len)?;
let sizes: &[T] = self.typed_buffer(1, self.len)?;
for i in 0..values_length {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug before. You can verify this by construction a list_view_array and then doing list_view_array.to_data().into_builder().build().unwrap() and it will panic, because values_length is the length of the inner values not of the list itself.

a10y added 3 commits October 18, 2025 14:39
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/list-view-select branch from 0c069d2 to 124b437 Compare October 20, 2025 17:12
Signed-off-by: Andrew Duffy <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @a10y 🙏

I went through the code carefully, and it all looks good to me

BTW if anyone is interested, here is the relevant spec portion: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

I think PR needs a few more tests but then it will be ready

a10y added 3 commits October 23, 2025 12:14
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y
Copy link
Contributor Author

a10y commented Oct 23, 2025

Thanks for reviewing @alamb! I had a few questions which I left open, but otherwise your comments have been addressed.

@alamb
Copy link
Contributor

alamb commented Oct 23, 2025

Code looks great -- just tests for eq and I think this will be good to merge

Signed-off-by: Andrew Duffy <[email protected]>
@a10y
Copy link
Contributor Author

a10y commented Oct 23, 2025

@alamb just pushed a commit with some tests, let me know what you think!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @a10y

for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be lhs_range_sizes ?
The earlier iterators are **_range_** ones, i.e. they take into account the lhs_start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right, good catch

for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@a10y
Copy link
Contributor Author

a10y commented Oct 24, 2025

Thank you for reviewing @martin-g , I'm trying to craft a good test case for the bug you found. Seems like lhs_offset is only non-zero when it's nested in a dict, list, or ree encoding

@alamb alamb merged commit 5e32cc6 into apache:main Oct 27, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

Thanks @a10y and @martin-g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants