8645: Support more operations on ListView#8
Conversation
Added support for ListView and LargeListView for the following operations: * `arrow_select::concat` * `arrow_select::filter_array` Signed-off-by: Andrew Duffy <[email protected]>
There was a bug in ArrayData validation for ListView, which became apparent when you tried to call `list_view.to_data().into_builder().build().unwrap()`. The range for checking the offsets/sizes was wrong and would trivially trigger an out of bounds check. Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
WalkthroughThe pull request adds comprehensive support for ListView and LargeListView data types across Arrow's data handling pipeline. Changes include pre-loop validation for offset/size consistency, equality comparison logic, transform operations (extend and null handling), concatenation, filtering, and selection (take) kernels. New modules encapsulate ListView-specific logic with parametrized i32/i64 offset types. Extensive test coverage validates operations on list-view arrays. Changes
Sequence DiagramsequenceDiagram
participant User
participant Arrow as Arrow API
participant Valid as Validation
participant Equal as Equality
participant Trans as Transform
participant Select as Selection
rect rgb(240, 248, 255)
Note over User,Select: ListView/LargeListView Operation Flow
end
User->>Arrow: Create/Manipulate ListView Array
activate Arrow
rect rgb(255, 245, 238)
Note over Valid: 1. Validate Phase
Arrow->>Valid: validate_offsets_and_sizes()
Valid->>Valid: Check offsets.len() == sizes.len()
Valid-->>Valid: ComputeError if mismatch
Valid->>Valid: Per-element bounds check
Valid-->>Arrow: Valid or Error
end
alt Equality Check
rect rgb(240, 255, 240)
Note over Equal: 2. Equality Phase
Arrow->>Equal: equal_values (lhs, rhs)
Equal->>Equal: list_view_equal<T>()
Equal->>Equal: Compare offsets & sizes
Equal->>Equal: Compare child data
Equal-->>Arrow: bool result
end
else Transformation
rect rgb(255, 250, 240)
Note over Trans: 3. Transform Phase
Arrow->>Trans: build_extend / extend_nulls
Trans->>Trans: Copy offsets & sizes buffers
Trans->>Trans: Append or extend nulls
Trans-->>Arrow: Mutable buffers ready
end
else Selection
rect rgb(245, 245, 255)
Note over Select: 4. Selection Phase
Arrow->>Select: concat / filter / take
Select->>Select: Process offsets & sizes
Select->>Select: Rebuild child data view
Select-->>Arrow: New ListView Array
end
end
deactivate Arrow
Arrow-->>User: Result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The changes introduce ListView/LargeListView support across nine files with consistent pattern application (generic i32/i64 offset dispatch) but heterogeneous operations (validation, equality, transform, concat, filter, take). While individual file changes are straightforward, the breadth requires tracing implementations across modules and verifying correctness of new generic functions with proper trait bounds and buffer handling. Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
| for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets | ||
| .iter() | ||
| .zip(rhs_range_offsets) | ||
| .zip(lhs_sizes) |
There was a problem hiding this comment.
In this loop you zip lhs_range_offsets with rhs_range_offsets and then with lhs_sizes, but lhs_sizes is the full buffer, not the sliced lhs_range_sizes. When lhs_start != 0 this will read sizes from the wrong indices and can produce incorrect equality results. Consider using the sliced sizes here (also applies to the similar loop in the else branch).
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The AI agent is correct that the zipped iterators might start from the wrong initial element with lhs_sizes. The third iterator should be lhs_range_sizes to use the same sub-slice of the iterator as the other two.
| return false; | ||
| } | ||
|
|
||
| for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets |
There was a problem hiding this comment.
In the null-handling path, for non-null entries the per-element sizes of lhs and rhs should be equal before comparing values. Currently the loop uses only the lhs size when calling equal_values, which could mark two lists equal if rhs has extra values beyond the lhs size.
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The length of the range offsets is compared earlier and it they do not match the equal_values() is not called at all. See https://github.com/martin-augment/arrow-rs/pull/8/files/d0da6c59482632690ff08dfd0c29d1f9de33c19f#diff-ba36cb5edf354d7753dceb65d0744b1bddf1ceb2ae1ef16b3417c83bc8d78aeeR92
Code Review: ListView/LargeListView Operations SupportThis PR implements support for ListView and LargeListView operations across the arrow-rs codebase, replacing unimplemented!() panics with full functionality. Critical Issue FoundLocation: arrow-data/src/equal/list_view.rs:71 and line 107 The code zips with lhs_sizes (the full buffer) instead of lhs_range_sizes (the sliced range). This will read wrong size values when arrays are sliced with an offset. Fix needed: Change .zip(lhs_sizes) to .zip(lhs_range_sizes) on both lines 71 and 107. Code QualityStrengths:
Areas for Improvement:
Performance & SecurityGood:
Minor Concerns:
Test CoverageExcellent coverage with tests for:
Suggested additions:
Overall AssessmentThis is a solid implementation following established patterns with comprehensive tests. However, the critical buffer indexing bug must be fixed before merging. Once fixed and the high-priority items addressed, this will be ready to merge. Recommendation: Approve with required changes |
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (8)
arrow-data/src/equal/list_view.rs (1)
73-76: Avoid unwrap on to_usize to prevent panic on invalid data.Return false if offsets/sizes can’t convert to usize; equality shouldn’t panic.
You can adopt the match-based conversion pattern shown in the previous diffs.
Also applies to: 117-120
arrow-select/src/concat.rs (1)
238-253: Optional: avoid concatenating unused child values for sliced inputs.Current approach concatenates entire values() of each input ListView, even parts never referenced by offsets/sizes. You can compute per-array min(offset)..max(offset+size) and slice values to reduce memory.
Happy to sketch a helper to derive minimal value ranges per input.
arrow-data/src/data.rs (2)
983-990: Remove redundant length check (always equal here).
typed_buffer(…, self.len)yields slices of identical length; this equality check is unreachable and adds noise. Drop it. Also prefer consistentInvalidArgumentErrorin this validator.- if offsets.len() != sizes.len() { - return Err(ArrowError::ComputeError(format!( - "ListView offsets len {} does not match sizes len {}", - offsets.len(), - sizes.len() - ))); - }
1004-1013: Avoid panic on overflow; return an ArrowError.
expectcan abort the process for adversarial data. Propagate a typed error instead.- if size - .checked_add(offset) - .expect("Offset and size have exceeded the usize boundary") - > values_length - { + let end = size.checked_add(offset).ok_or_else(|| { + ArrowError::InvalidArgumentError(format!( + "Offset + size overflow at index {} for {}", + i, self.data_type + )) + })?; + if end > values_length { return Err(ArrowError::InvalidArgumentError(format!( "Size {} at index {} is larger than the remaining values for {}", size, i, self.data_type ))); }arrow/tests/array_equal.rs (2)
765-782: Helper is fine; consider reserving capacity.Optionally use
GenericListViewBuilder::<O,_>::with_capacity(Int32Builder::new(), data_size_hint)to reduce reallocs when data size is known.
784-823: Solid coverage; consider renaming for clarity.Rename
test_test_list_view_arraytotest_list_view_eq_take_and_slicefor readability. No functional changes needed.arrow-data/src/transform/list_view.rs (2)
24-47: Remove unused trait bound; consider slice copy.
CheckedAddisn’t used; drop it from the bound.- For large ranges, prefer
extend_from_slice(&offsets[start..start+len])and same for sizes to reduce push overhead.-pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( +pub(super) fn build_extend<T: ArrowNativeType + Integer>( @@ - for &offset in &offsets[start..start + len] { - offset_buffer.push(offset); - } + offset_buffer.extend_from_slice(&offsets[start..start + len]); @@ - for &size in &sizes[start..start + len] { - sizes_buffer.push(size); - } + sizes_buffer.extend_from_slice(&sizes[start..start + len]);
49-56: Nulls: push loops ok; minor perf nit.You can
reserve_exact(len)once and useextend(std::iter::repeat(T::default()).take(len))for both buffers.pub(super) fn extend_nulls<T: ArrowNativeType>(mutable: &mut _MutableArrayData, len: usize) { let offset_buffer = &mut mutable.buffer1; let sizes_buffer = &mut mutable.buffer2; - // We push 0 as a placeholder for NULL values in both the offsets and sizes - (0..len).for_each(|_| offset_buffer.push(T::default())); - (0..len).for_each(|_| sizes_buffer.push(T::default())); + offset_buffer.reserve_exact(len); + sizes_buffer.reserve_exact(len); + offset_buffer.extend(std::iter::repeat(T::default()).take(len)); + sizes_buffer.extend(std::iter::repeat(T::default()).take(len)); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
arrow-data/src/data.rs(1 hunks)arrow-data/src/equal/list_view.rs(1 hunks)arrow-data/src/equal/mod.rs(3 hunks)arrow-data/src/transform/list_view.rs(1 hunks)arrow-data/src/transform/mod.rs(6 hunks)arrow-select/src/concat.rs(11 hunks)arrow-select/src/filter.rs(3 hunks)arrow-select/src/take.rs(5 hunks)arrow/tests/array_equal.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
arrow-data/src/data.rs (2)
arrow-array/src/array/list_array.rs (2)
offsets(311-313)len(549-551)arrow-array/src/array/list_view_array.rs (2)
offsets(262-264)sizes(277-279)
arrow-data/src/equal/list_view.rs (2)
arrow-data/src/data.rs (3)
count_nulls(51-62)len(402-404)len(1942-1945)arrow-data/src/equal/mod.rs (4)
equal(161-166)equal_values(62-129)list_view_equal(109-109)list_view_equal(110-110)
arrow-select/src/take.rs (4)
arrow-data/src/data.rs (5)
new(1857-1859)new(1919-1932)nulls(390-392)nulls(1948-1953)slice(550-578)arrow-array/src/cast.rs (1)
as_list_view(927-929)arrow-array/src/array/list_view_array.rs (6)
nulls(416-418)to_data(379-381)x(1010-1010)x(1045-1045)x(1080-1080)x(1119-1119)arrow-select/src/concat.rs (4)
x(106-106)x(157-157)x(220-220)x(357-357)
arrow-select/src/concat.rs (5)
arrow-array/src/array/list_array.rs (7)
x(1269-1269)nulls(569-571)new(266-273)values(322-324)offsets(311-313)offset(565-567)try_new(209-259)arrow-array/src/array/list_view_array.rs (8)
x(1010-1010)x(1045-1045)x(1080-1080)x(1119-1119)nulls(416-418)sizes(277-279)offsets(262-264)offset(412-414)arrow-buffer/src/buffer/offset.rs (3)
new(68-79)offsets(242-250)as_ref(208-210)arrow-array/src/builder/generic_list_view_builder.rs (3)
values(119-121)with_capacity(85-96)array(443-443)arrow-buffer/src/buffer/scalar.rs (1)
as_ref(144-146)
arrow-select/src/filter.rs (2)
arrow-data/src/data.rs (10)
new(1857-1859)new(1919-1932)nulls(390-392)nulls(1948-1953)null_count(420-425)null_count(1956-1959)buffer(586-588)buffer(922-922)buffer(1496-1496)new_unchecked(273-300)arrow-array/src/array/list_view_array.rs (2)
nulls(416-418)to_data(379-381)
arrow-data/src/transform/list_view.rs (2)
arrow-data/src/transform/mod.rs (5)
build_extend(238-283)array(220-220)new(401-403)len(753-755)extend_nulls(742-749)arrow-array/src/array/list_view_array.rs (1)
sizes(277-279)
arrow/tests/array_equal.rs (2)
arrow-select/src/take.rs (3)
take(86-102)data(1026-1028)indices(702-713)arrow-array/src/array/list_view_array.rs (3)
Buffer(1097-1097)test_list_view_array(586-619)test_large_list_view_array(622-655)
arrow-data/src/equal/mod.rs (2)
arrow-data/src/equal/run.rs (1)
run_equal(26-86)arrow-data/src/equal/list_view.rs (5)
list_view_equal(24-129)lhs(31-31)lhs(32-32)rhs(34-34)rhs(35-35)
arrow-data/src/transform/mod.rs (1)
arrow-data/src/transform/list_view.rs (4)
build_extend(24-47)array(27-27)array(28-28)extend_nulls(49-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Build wasm32
- GitHub Check: PySpark Integration Test (stable)
- GitHub Check: Test
- GitHub Check: Clippy
- GitHub Check: Check Compilation
- GitHub Check: Pyarrow C Data Interface (stable, 17)
- GitHub Check: Pyarrow C Data Interface (stable, 15)
- GitHub Check: Pyarrow C Data Interface (stable, 16)
- GitHub Check: MIRI
- GitHub Check: Archery test With other arrows
- GitHub Check: Clippy
- GitHub Check: Verify MSRV (Minimum Supported Rust Version)
- GitHub Check: codex
- GitHub Check: Test on Mac
- GitHub Check: Rustdocs are clean
- GitHub Check: Test
- GitHub Check: Test
- GitHub Check: Test on Windows
- GitHub Check: claude-review
🔇 Additional comments (7)
arrow-select/src/take.rs (1)
630-655: LGTM: take_list_view reuses child values and preserves semantics.Offsets/sizes are taken correctly; nulls handled; child data reused. Looks solid.
arrow-data/src/transform/mod.rs (1)
270-272: ListView transform likely drops child values (child_data not extended).build_extend delegates to list_view::build_extend but that only pushes offsets/sizes; no child_data writes occur. freeze() later uses data.child_data[0].freeze(), which will be empty if never extended. Kernels relying on MutableArrayData (e.g., filter) could produce ListView arrays with empty values.
Please verify ListView paths that use MutableArrayData (filter, fallback concat, etc.). If broken, fix either:
- Update arrow-data/src/transform/list_view.rs build_extend to also extend child_data[0] with the source values (copy approach), or
- Teach into_builder for ListView to reuse the input child ArrayData(s) when possible (like Dictionary/ByteView do), or
- Provide dedicated filter/take/concat kernels for ListView and avoid MutableArrayData for it altogether.
Run to locate potential call sites:
Also applies to: 317-319, 792-799
arrow-select/src/filter.rs (2)
349-355: ListView branches wired correctly.Type-directed
as_list_view::<i32/i64>()inference is fine and keeps the dispatch symmetric with other types.
869-895: Implementation looks good; preserve child by reference.Offsets/sizes/nulls are filtered as expected and child values are reused. No functional issues spotted.
Please run existing fuzz tests with ListView enabled in CI to exercise mixed null/run selectivity cases (already covered for other types).
arrow/tests/array_equal.rs (2)
25-31: Imports for ListView tests are appropriate.
825-833: Wrappers for i32/i64 offset variants look good.arrow-data/src/equal/mod.rs (1)
33-47: Wiring for ListView equality added correctly.
| for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets | ||
| .iter() | ||
| .zip(rhs_range_offsets) | ||
| .zip(lhs_sizes) | ||
| { | ||
| let lhs_offset = lhs_offset.to_usize().unwrap(); | ||
| let rhs_offset = rhs_offset.to_usize().unwrap(); | ||
| let size = size.to_usize().unwrap(); | ||
|
|
||
| // Check if offsets are valid for the given range | ||
| if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Fix wrong zip and use per-range sizes to avoid mismatches on slices.
You’re zipping offsets with the full lhs_sizes buffer, not the sliced range. This misaligns sizes when lhs_start > 0. Replace the loop with index-based access over the sliced ranges.
Apply:
- for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets
- .iter()
- .zip(rhs_range_offsets)
- .zip(lhs_sizes)
- {
- let lhs_offset = lhs_offset.to_usize().unwrap();
- let rhs_offset = rhs_offset.to_usize().unwrap();
- let size = size.to_usize().unwrap();
-
- // Check if offsets are valid for the given range
- if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
- return false;
- }
- }
+ for i in 0..len {
+ let lhs_offset = match lhs_range_offsets[i].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ let rhs_offset = match rhs_range_offsets[i].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ let size = match lhs_range_sizes[i].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+ return false;
+ }
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 68 to 81, the loop zips
lhs_range_offsets and rhs_range_offsets with the entire lhs_sizes buffer,
causing size mismatches when lhs_start > 0; replace the zipped iteration with an
index-based loop over the sliced ranges (use the length of the range_offsets
slice), and for each index read lhs_range_offsets[i], rhs_range_offsets[i], and
the corresponding per-range size from the sliced lhs_sizes (or the correct
offset into lhs_sizes) converting each to usize and then call equal_values with
those values; ensure you iterate exactly over the number of ranges so offsets
and sizes stay aligned.
| for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets | ||
| .iter() | ||
| .zip(rhs_range_offsets) | ||
| .zip(lhs_sizes) | ||
| .enumerate() | ||
| { | ||
| let lhs_is_null = lhs_nulls.is_null(index); | ||
| let rhs_is_null = rhs_nulls.is_null(index); | ||
|
|
||
| if lhs_is_null != rhs_is_null { | ||
| return false; | ||
| } | ||
|
|
||
| let lhs_offset = lhs_offset.to_usize().unwrap(); | ||
| let rhs_offset = rhs_offset.to_usize().unwrap(); | ||
| let size = size.to_usize().unwrap(); | ||
|
|
||
| // Check if values match in the range | ||
| if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Nullable path: enforce per-index size equality and fix the same zip bug.
When both sides are non-null you must first check sizes[i] equality; otherwise differing lengths can compare equal by truncation. Also switch to the sliced size range.
Apply:
- for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
- .iter()
- .zip(rhs_range_offsets)
- .zip(lhs_sizes)
- .enumerate()
- {
- let lhs_is_null = lhs_nulls.is_null(index);
- let rhs_is_null = rhs_nulls.is_null(index);
-
- if lhs_is_null != rhs_is_null {
- return false;
- }
-
- let lhs_offset = lhs_offset.to_usize().unwrap();
- let rhs_offset = rhs_offset.to_usize().unwrap();
- let size = size.to_usize().unwrap();
-
- // Check if values match in the range
- if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
- return false;
- }
- }
+ for index in 0..len {
+ let lhs_is_null = lhs_nulls.is_null(index);
+ let rhs_is_null = rhs_nulls.is_null(index);
+ if lhs_is_null != rhs_is_null {
+ return false;
+ }
+ if !lhs_is_null {
+ if lhs_range_sizes[index] != rhs_range_sizes[index] {
+ return false;
+ }
+ let lhs_offset = match lhs_range_offsets[index].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ let rhs_offset = match rhs_range_offsets[index].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ let size = match lhs_range_sizes[index].to_usize() {
+ Some(v) => v,
+ None => return false,
+ };
+ if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+ return false;
+ }
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets | |
| .iter() | |
| .zip(rhs_range_offsets) | |
| .zip(lhs_sizes) | |
| .enumerate() | |
| { | |
| let lhs_is_null = lhs_nulls.is_null(index); | |
| let rhs_is_null = rhs_nulls.is_null(index); | |
| if lhs_is_null != rhs_is_null { | |
| return false; | |
| } | |
| let lhs_offset = lhs_offset.to_usize().unwrap(); | |
| let rhs_offset = rhs_offset.to_usize().unwrap(); | |
| let size = size.to_usize().unwrap(); | |
| // Check if values match in the range | |
| if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | |
| return false; | |
| } | |
| } | |
| for index in 0..len { | |
| let lhs_is_null = lhs_nulls.is_null(index); | |
| let rhs_is_null = rhs_nulls.is_null(index); | |
| if lhs_is_null != rhs_is_null { | |
| return false; | |
| } | |
| if !lhs_is_null { | |
| if lhs_range_sizes[index] != rhs_range_sizes[index] { | |
| return false; | |
| } | |
| let lhs_offset = match lhs_range_offsets[index].to_usize() { | |
| Some(v) => v, | |
| None => return false, | |
| }; | |
| let rhs_offset = match rhs_range_offsets[index].to_usize() { | |
| Some(v) => v, | |
| None => return false, | |
| }; | |
| let size = match lhs_range_sizes[index].to_usize() { | |
| Some(v) => v, | |
| None => return false, | |
| }; | |
| if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | |
| return false; | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 104 to 125, the nullable
branch currently can compare lists of different lengths as equal and also uses
the wrong zip (causing misaligned size indexing); fix by iterating over the
three aligned iterators (lhs_range_offsets, rhs_range_offsets, and the sliced
sizes iterator) so each index gets its corresponding size, then for each index
when both sides are non-null first compare sizes[i] for equality and return
false if they differ, and only if sizes match call equal_values with the
per-index offsets and that size; ensure you use the correctly sliced sizes
iterator (not the whole lhs_sizes misaligned with zips) so indices line up.
| DataType::ListView(_) => list_view_equal::<i32>(lhs, rhs, lhs_start, rhs_start, len), | ||
| DataType::LargeListView(_) => list_view_equal::<i64>(lhs, rhs, lhs_start, rhs_start, len), | ||
| DataType::FixedSizeList(_, _) => fixed_list_equal(lhs, rhs, lhs_start, rhs_start, len), |
There was a problem hiding this comment.
🧩 Analysis chain
ListView/LargeListView equality dispatch is correct; verify inner fn slices sizes correctly.
list_view_equal should iterate the sliced sizes for the range, not the full slice. Please confirm it zips with lhs_range_sizes/rhs_range_sizes, not the entire lhs_sizes.
If found, change:
zip(lhs_sizes)->zip(lhs_range_sizes.iter())(and analogously in the branch withenumerate()).
🏁 Script executed:
#!/bin/bash
# Locate potential misuse of full sizes slice inside list_view_equal
rg -nP 'list_view_equal.*\{(?s).*zip\(\s*lhs_sizes\s*\)' arrow-data/src/equal/list_view.rs -C2
rg -nP 'zip\(\s*lhs_sizes\s*\)' arrow-data/src/equal/list_view.rs -n -C3Length of output: 589
This script will help retrieve the complete function implementation.
🏁 Script executed:
#!/bin/bash
# Extract the full list_view_equal function definition
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 60Length of output: 73
🏁 Script executed:
#!/bin/bash
# Extract the full list_view_equal function definition
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 60Length of output: 2225
🏁 Script executed:
#!/bin/bash
# Extract the rest of the function to see the null handling branch
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 100 | tail -50Length of output: 2071
Fix two instances of incorrect buffer slicing in list_view_equal.
The function correctly creates lhs_range_sizes and rhs_range_sizes slices at lines 49/85, but then incorrectly zips with the full lhs_sizes buffer at lines 71 and 107. This causes the loop to iterate over wrong indices when lhs_start > 0.
Required changes:
- Line 71:
.zip(lhs_sizes)→.zip(lhs_range_sizes.iter()) - Line 107:
.zip(lhs_sizes)→.zip(lhs_range_sizes.iter())
🤖 Prompt for AI Agents
In arrow-data/src/equal/mod.rs around lines 49–111, list_view_equal creates
lhs_range_sizes and rhs_range_sizes slices but later zips with the full
lhs_sizes buffer causing incorrect iteration when lhs_start > 0; change the two
zips at line 71 and line 107 to zip with lhs_range_sizes.iter() instead of
lhs_sizes so the loop iterates over the sliced ranges (i.e., replace
.zip(lhs_sizes) with .zip(lhs_range_sizes.iter()) at both places).
| let mut offsets = MutableBuffer::with_capacity(lists.iter().map(|l| l.offsets().len()).sum()); | ||
| let mut global_offset = OffsetSize::zero(); | ||
| for l in lists.iter() { | ||
| for &offset in l.offsets() { | ||
| offsets.push(offset + global_offset); | ||
| } | ||
|
|
||
| // advance the offsets | ||
| global_offset += OffsetSize::from_usize(l.values().len()).unwrap(); | ||
| } |
There was a problem hiding this comment.
Guard against offset overflow when adjusting by global_offset.
For i32 offsets, adding global_offset can overflow. Use checked_add and return OffsetOverflowError.
Apply:
- for l in lists.iter() {
- for &offset in l.offsets() {
- offsets.push(offset + global_offset);
- }
-
- // advance the offsets
- global_offset += OffsetSize::from_usize(l.values().len()).unwrap();
- }
+ for l in lists.iter() {
+ for &offset in l.offsets() {
+ let adj = offset
+ .checked_add(&global_offset)
+ .ok_or_else(|| ArrowError::OffsetOverflowError(
+ (offset.as_usize() as usize)
+ ))?;
+ offsets.push(adj);
+ }
+ // advance the offsets
+ let incr = OffsetSize::from_usize(l.values().len())
+ .ok_or_else(|| ArrowError::OffsetOverflowError(l.values().len()))?;
+ global_offset = global_offset
+ .checked_add(&incr)
+ .ok_or_else(|| ArrowError::OffsetOverflowError(l.values().len()))?;
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In arrow-select/src/concat.rs around lines 244 to 253, the code adds each list
offset to a running global_offset using plain addition which can overflow for
i32 offsets; replace the direct addition with a checked_add on the OffsetSize
(or its underlying integer) and if None return/propagate an OffsetOverflowError;
likewise when advancing global_offset, use conversions that check for overflow
(e.g., OffsetSize::from_usize(...)? or checked_add) so any overflow returns
OffsetOverflowError instead of wrapping. Ensure the function signature
propagates the error (or maps it into the crate's OffsetOverflowError) and
update callers if necessary.
8645: To review by AI
Summary by CodeRabbit
Release Notes
New Features
Tests