Skip to content

8645: Support more operations on ListView#8

Open
martin-augment wants to merge 12 commits intomainfrom
pr-8645-2025-10-24-13-02-56
Open

8645: Support more operations on ListView#8
martin-augment wants to merge 12 commits intomainfrom
pr-8645-2025-10-24-13-02-56

Conversation

@martin-augment
Copy link
Owner

@martin-augment martin-augment commented Oct 24, 2025

8645: To review by AI

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive support for ListView and LargeListView array types across concatenation, filtering, element selection, and equality operations.
    • Implemented validation checks for offsets and sizes buffer consistency during array processing.
  • Tests

    • Added extensive test coverage for ListView array operations including slicing, reordering, and null handling.

a10y added 12 commits October 17, 2025 13:35
Added support for ListView and LargeListView for the following
operations:

* `arrow_select::concat`
* `arrow_select::filter_array`

Signed-off-by: Andrew Duffy <[email protected]>
There was a bug in ArrayData validation for ListView, which became
apparent when you tried to call
`list_view.to_data().into_builder().build().unwrap()`. The range for
checking the offsets/sizes was wrong and would trivially trigger an out
of bounds check.

Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@github-actions github-actions bot added the arrow label Oct 24, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 24, 2025

Walkthrough

The pull request adds comprehensive support for ListView and LargeListView data types across Arrow's data handling pipeline. Changes include pre-loop validation for offset/size consistency, equality comparison logic, transform operations (extend and null handling), concatenation, filtering, and selection (take) kernels. New modules encapsulate ListView-specific logic with parametrized i32/i64 offset types. Extensive test coverage validates operations on list-view arrays.

Changes

Cohort / File(s) Summary
Data validation
arrow-data/src/data.rs
Enhanced validate_offsets_and_sizes with pre-loop validation to detect mismatched offsets and sizes buffer lengths, routing to early ComputeError before per-element bounds checking.
Equality support
arrow-data/src/equal/list_view.rs, arrow-data/src/equal/mod.rs
New list_view.rs module introduces generic list_view_equal<T> function for comparing ListView/LargeListView arrays considering nulls and offsets. Module wired into equal/mod.rs dispatch with i32/i64 variants replacing prior unimplemented stubs.
Transform operations
arrow-data/src/transform/list_view.rs, arrow-data/src/transform/mod.rs
New transform/list_view.rs module provides generic build_extend and extend_nulls functions for ListView/LargeListView offset/size buffer manipulation. Integrated into transform/mod.rs for capacity allocation and buffer handling with i32/i64 offset type dispatch.
Selection operations
arrow-select/src/concat.rs, arrow-select/src/filter.rs, arrow-select/src/take.rs
Added concat_list_view, filter_list_view, and take_list_view helpers to support ListView/LargeListView concatenation, filtering, and selection. Each module routes i32/i64 offset types appropriately; includes comprehensive test coverage for standard and sliced variants.
Test coverage
arrow/tests/array_equal.rs
New test helpers (create_list_view_array) and test functions (test_list_view_array, test_large_list_view_array) validate equality semantics, reordering, and slicing for ListView-based arrays.

Sequence Diagram

sequenceDiagram
    participant User
    participant Arrow as Arrow API
    participant Valid as Validation
    participant Equal as Equality
    participant Trans as Transform
    participant Select as Selection
    
    rect rgb(240, 248, 255)
    Note over User,Select: ListView/LargeListView Operation Flow
    end
    
    User->>Arrow: Create/Manipulate ListView Array
    activate Arrow
    
    rect rgb(255, 245, 238)
    Note over Valid: 1. Validate Phase
    Arrow->>Valid: validate_offsets_and_sizes()
    Valid->>Valid: Check offsets.len() == sizes.len()
    Valid-->>Valid: ComputeError if mismatch
    Valid->>Valid: Per-element bounds check
    Valid-->>Arrow: Valid or Error
    end
    
    alt Equality Check
        rect rgb(240, 255, 240)
        Note over Equal: 2. Equality Phase
        Arrow->>Equal: equal_values (lhs, rhs)
        Equal->>Equal: list_view_equal<T>()
        Equal->>Equal: Compare offsets & sizes
        Equal->>Equal: Compare child data
        Equal-->>Arrow: bool result
        end
    else Transformation
        rect rgb(255, 250, 240)
        Note over Trans: 3. Transform Phase
        Arrow->>Trans: build_extend / extend_nulls
        Trans->>Trans: Copy offsets & sizes buffers
        Trans->>Trans: Append or extend nulls
        Trans-->>Arrow: Mutable buffers ready
        end
    else Selection
        rect rgb(245, 245, 255)
        Note over Select: 4. Selection Phase
        Arrow->>Select: concat / filter / take
        Select->>Select: Process offsets & sizes
        Select->>Select: Rebuild child data view
        Select-->>Arrow: New ListView Array
        end
    end
    
    deactivate Arrow
    Arrow-->>User: Result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes introduce ListView/LargeListView support across nine files with consistent pattern application (generic i32/i64 offset dispatch) but heterogeneous operations (validation, equality, transform, concat, filter, take). While individual file changes are straightforward, the breadth requires tracing implementations across modules and verifying correctness of new generic functions with proper trait bounds and buffer handling.

Poem

🐰 Lists now have views, both large and small,
Offsets and sizes aligned through it all,
From concat to take, through filter and equal,
ListView support makes the Arrow wheel roll!
With i32 and i64 holding the way,
Views of our lists leap and play! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is significantly incomplete compared to the required template. The provided description "8645: To review by AI" appears to be a placeholder and lacks all required sections: it does not specify which issue this closes, provide rationale for the changes, summarize what changes are included, indicate whether tests are present, or address any user-facing changes. The description provides no meaningful information beyond a vague reference to issue 8645, failing to meet the repository's documentation standards. The description must be substantially revised to follow the template structure. It should include a "Closes #" statement linking to the relevant issue, explain why these ListView operations are being added, provide a summary of the changes across the multiple files (equality, filtering, concatenation, taking, and transformation support), confirm that comprehensive tests have been added (as evidenced by the test changes), and clarify any user-facing API additions or breaking changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "8645: Support more operations on ListView" clearly and accurately reflects the primary focus of this pull request. The changeset spans multiple files across arrow-data, arrow-select, and arrow modules to enable operations like equality comparison, filtering, concatenation, taking, and transformation on ListView and LargeListView types. The title is concise, specific, and would be readily understood by reviewers scanning the commit history without being overly vague or containing distracting elements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8645-2025-10-24-13-02-56

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this loop you zip lhs_range_offsets with rhs_range_offsets and then with lhs_sizes, but lhs_sizes is the full buffer, not the sliced lhs_range_sizes. When lhs_start != 0 this will read sizes from the wrong indices and can produce incorrect equality results. Consider using the sliced sizes here (also applies to the similar loop in the else branch).

🤖 React with 👍 or 👎 to let us know if the comment was useful.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The AI agent is correct that the zipped iterators might start from the wrong initial element with lhs_sizes. The third iterator should be lhs_range_sizes to use the same sub-slice of the iterator as the other two.

return false;
}

for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the null-handling path, for non-null entries the per-element sizes of lhs and rhs should be equal before comparing values. Currently the loop uses only the lhs size when calling equal_values, which could mark two lists equal if rhs has extra values beyond the lhs size.

🤖 React with 👍 or 👎 to let us know if the comment was useful.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The length of the range offsets is compared earlier and it they do not match the equal_values() is not called at all. See https://github.com/martin-augment/arrow-rs/pull/8/files/d0da6c59482632690ff08dfd0c29d1f9de33c19f#diff-ba36cb5edf354d7753dceb65d0744b1bddf1ceb2ae1ef16b3417c83bc8d78aeeR92

@claude
Copy link

claude bot commented Oct 24, 2025

Code Review: ListView/LargeListView Operations Support

This PR implements support for ListView and LargeListView operations across the arrow-rs codebase, replacing unimplemented!() panics with full functionality.

Critical Issue Found

Location: arrow-data/src/equal/list_view.rs:71 and line 107

The code zips with lhs_sizes (the full buffer) instead of lhs_range_sizes (the sliced range). This will read wrong size values when arrays are sliced with an offset.

Fix needed: Change .zip(lhs_sizes) to .zip(lhs_range_sizes) on both lines 71 and 107.

Code Quality

Strengths:

  • Consistent with existing List/LargeList patterns
  • Proper code organization and licensing
  • Good use of Rust generics
  • Excellent test coverage including concat, filter, take, and equality operations
  • Tests cover both i32 and i64 offset types, slicing, and null handling

Areas for Improvement:

  • Code duplication between null and non-null pathways in list_view_equal
  • Misleading comment at line 84 (says non-null pathway but is nullable pathway)
  • Multiple uses of unwrap() without error messages - consider using expect() with descriptive messages

Performance & Security

Good:

  • Zero-copy design properly exploited
  • Early returns for null count mismatches
  • No major security issues
  • Rust memory safety guarantees apply

Minor Concerns:

  • Potential integer overflow in concat operations on 32-bit systems
  • Some redundant length checks could be removed or documented

Test Coverage

Excellent coverage with tests for:

  • Basic and sliced concatenation
  • Filtering with various predicates
  • Take operations with reversed, null, and sliced inputs
  • Equality checks with reordering

Suggested additions:

  • Empty ListView arrays
  • All-null ListView arrays
  • Edge cases for overflow

Overall Assessment

This is a solid implementation following established patterns with comprehensive tests. However, the critical buffer indexing bug must be fixed before merging. Once fixed and the high-priority items addressed, this will be ready to merge.

Recommendation: Approve with required changes

@github-actions
Copy link

  • arrow-data/src/equal/list_view.rs:68 – The inner loop zips the sliced offsets with lhs_sizes, which is the entire buffer, instead of the sliced range. When lhs_start (or rhs_start) is non-zero—e.g. comparing slices of a ListView—this pulls the wrong lengths (often the size of index 0) and can make equal_values run with size == 0, returning true even when the sliced lists differ. The same issue is present in the null-aware branch at line 104. Please zip against lhs_range_sizes (and rhs_range_sizes if needed) so the per-row lengths correspond to the rows being compared.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
arrow-data/src/equal/list_view.rs (1)

73-76: Avoid unwrap on to_usize to prevent panic on invalid data.

Return false if offsets/sizes can’t convert to usize; equality shouldn’t panic.

You can adopt the match-based conversion pattern shown in the previous diffs.

Also applies to: 117-120

arrow-select/src/concat.rs (1)

238-253: Optional: avoid concatenating unused child values for sliced inputs.

Current approach concatenates entire values() of each input ListView, even parts never referenced by offsets/sizes. You can compute per-array min(offset)..max(offset+size) and slice values to reduce memory.

Happy to sketch a helper to derive minimal value ranges per input.

arrow-data/src/data.rs (2)

983-990: Remove redundant length check (always equal here).

typed_buffer(…, self.len) yields slices of identical length; this equality check is unreachable and adds noise. Drop it. Also prefer consistent InvalidArgumentError in this validator.

-        if offsets.len() != sizes.len() {
-            return Err(ArrowError::ComputeError(format!(
-                "ListView offsets len {} does not match sizes len {}",
-                offsets.len(),
-                sizes.len()
-            )));
-        }

1004-1013: Avoid panic on overflow; return an ArrowError.

expect can abort the process for adversarial data. Propagate a typed error instead.

-            if size
-                .checked_add(offset)
-                .expect("Offset and size have exceeded the usize boundary")
-                > values_length
-            {
+            let end = size.checked_add(offset).ok_or_else(|| {
+                ArrowError::InvalidArgumentError(format!(
+                    "Offset + size overflow at index {} for {}",
+                    i, self.data_type
+                ))
+            })?;
+            if end > values_length {
                 return Err(ArrowError::InvalidArgumentError(format!(
                     "Size {} at index {} is larger than the remaining values for {}",
                     size, i, self.data_type
                 )));
             }
arrow/tests/array_equal.rs (2)

765-782: Helper is fine; consider reserving capacity.

Optionally use GenericListViewBuilder::<O,_>::with_capacity(Int32Builder::new(), data_size_hint) to reduce reallocs when data size is known.


784-823: Solid coverage; consider renaming for clarity.

Rename test_test_list_view_array to test_list_view_eq_take_and_slice for readability. No functional changes needed.

arrow-data/src/transform/list_view.rs (2)

24-47: Remove unused trait bound; consider slice copy.

  • CheckedAdd isn’t used; drop it from the bound.
  • For large ranges, prefer extend_from_slice(&offsets[start..start+len]) and same for sizes to reduce push overhead.
-pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
+pub(super) fn build_extend<T: ArrowNativeType + Integer>(
@@
-            for &offset in &offsets[start..start + len] {
-                offset_buffer.push(offset);
-            }
+            offset_buffer.extend_from_slice(&offsets[start..start + len]);
@@
-            for &size in &sizes[start..start + len] {
-                sizes_buffer.push(size);
-            }
+            sizes_buffer.extend_from_slice(&sizes[start..start + len]);

49-56: Nulls: push loops ok; minor perf nit.

You can reserve_exact(len) once and use extend(std::iter::repeat(T::default()).take(len)) for both buffers.

 pub(super) fn extend_nulls<T: ArrowNativeType>(mutable: &mut _MutableArrayData, len: usize) {
     let offset_buffer = &mut mutable.buffer1;
     let sizes_buffer = &mut mutable.buffer2;
-    // We push 0 as a placeholder for NULL values in both the offsets and sizes
-    (0..len).for_each(|_| offset_buffer.push(T::default()));
-    (0..len).for_each(|_| sizes_buffer.push(T::default()));
+    offset_buffer.reserve_exact(len);
+    sizes_buffer.reserve_exact(len);
+    offset_buffer.extend(std::iter::repeat(T::default()).take(len));
+    sizes_buffer.extend(std::iter::repeat(T::default()).take(len));
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3b262 and d0da6c5.

📒 Files selected for processing (9)
  • arrow-data/src/data.rs (1 hunks)
  • arrow-data/src/equal/list_view.rs (1 hunks)
  • arrow-data/src/equal/mod.rs (3 hunks)
  • arrow-data/src/transform/list_view.rs (1 hunks)
  • arrow-data/src/transform/mod.rs (6 hunks)
  • arrow-select/src/concat.rs (11 hunks)
  • arrow-select/src/filter.rs (3 hunks)
  • arrow-select/src/take.rs (5 hunks)
  • arrow/tests/array_equal.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
arrow-data/src/data.rs (2)
arrow-array/src/array/list_array.rs (2)
  • offsets (311-313)
  • len (549-551)
arrow-array/src/array/list_view_array.rs (2)
  • offsets (262-264)
  • sizes (277-279)
arrow-data/src/equal/list_view.rs (2)
arrow-data/src/data.rs (3)
  • count_nulls (51-62)
  • len (402-404)
  • len (1942-1945)
arrow-data/src/equal/mod.rs (4)
  • equal (161-166)
  • equal_values (62-129)
  • list_view_equal (109-109)
  • list_view_equal (110-110)
arrow-select/src/take.rs (4)
arrow-data/src/data.rs (5)
  • new (1857-1859)
  • new (1919-1932)
  • nulls (390-392)
  • nulls (1948-1953)
  • slice (550-578)
arrow-array/src/cast.rs (1)
  • as_list_view (927-929)
arrow-array/src/array/list_view_array.rs (6)
  • nulls (416-418)
  • to_data (379-381)
  • x (1010-1010)
  • x (1045-1045)
  • x (1080-1080)
  • x (1119-1119)
arrow-select/src/concat.rs (4)
  • x (106-106)
  • x (157-157)
  • x (220-220)
  • x (357-357)
arrow-select/src/concat.rs (5)
arrow-array/src/array/list_array.rs (7)
  • x (1269-1269)
  • nulls (569-571)
  • new (266-273)
  • values (322-324)
  • offsets (311-313)
  • offset (565-567)
  • try_new (209-259)
arrow-array/src/array/list_view_array.rs (8)
  • x (1010-1010)
  • x (1045-1045)
  • x (1080-1080)
  • x (1119-1119)
  • nulls (416-418)
  • sizes (277-279)
  • offsets (262-264)
  • offset (412-414)
arrow-buffer/src/buffer/offset.rs (3)
  • new (68-79)
  • offsets (242-250)
  • as_ref (208-210)
arrow-array/src/builder/generic_list_view_builder.rs (3)
  • values (119-121)
  • with_capacity (85-96)
  • array (443-443)
arrow-buffer/src/buffer/scalar.rs (1)
  • as_ref (144-146)
arrow-select/src/filter.rs (2)
arrow-data/src/data.rs (10)
  • new (1857-1859)
  • new (1919-1932)
  • nulls (390-392)
  • nulls (1948-1953)
  • null_count (420-425)
  • null_count (1956-1959)
  • buffer (586-588)
  • buffer (922-922)
  • buffer (1496-1496)
  • new_unchecked (273-300)
arrow-array/src/array/list_view_array.rs (2)
  • nulls (416-418)
  • to_data (379-381)
arrow-data/src/transform/list_view.rs (2)
arrow-data/src/transform/mod.rs (5)
  • build_extend (238-283)
  • array (220-220)
  • new (401-403)
  • len (753-755)
  • extend_nulls (742-749)
arrow-array/src/array/list_view_array.rs (1)
  • sizes (277-279)
arrow/tests/array_equal.rs (2)
arrow-select/src/take.rs (3)
  • take (86-102)
  • data (1026-1028)
  • indices (702-713)
arrow-array/src/array/list_view_array.rs (3)
  • Buffer (1097-1097)
  • test_list_view_array (586-619)
  • test_large_list_view_array (622-655)
arrow-data/src/equal/mod.rs (2)
arrow-data/src/equal/run.rs (1)
  • run_equal (26-86)
arrow-data/src/equal/list_view.rs (5)
  • list_view_equal (24-129)
  • lhs (31-31)
  • lhs (32-32)
  • rhs (34-34)
  • rhs (35-35)
arrow-data/src/transform/mod.rs (1)
arrow-data/src/transform/list_view.rs (4)
  • build_extend (24-47)
  • array (27-27)
  • array (28-28)
  • extend_nulls (49-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: Build wasm32
  • GitHub Check: PySpark Integration Test (stable)
  • GitHub Check: Test
  • GitHub Check: Clippy
  • GitHub Check: Check Compilation
  • GitHub Check: Pyarrow C Data Interface (stable, 17)
  • GitHub Check: Pyarrow C Data Interface (stable, 15)
  • GitHub Check: Pyarrow C Data Interface (stable, 16)
  • GitHub Check: MIRI
  • GitHub Check: Archery test With other arrows
  • GitHub Check: Clippy
  • GitHub Check: Verify MSRV (Minimum Supported Rust Version)
  • GitHub Check: codex
  • GitHub Check: Test on Mac
  • GitHub Check: Rustdocs are clean
  • GitHub Check: Test
  • GitHub Check: Test
  • GitHub Check: Test on Windows
  • GitHub Check: claude-review
🔇 Additional comments (7)
arrow-select/src/take.rs (1)

630-655: LGTM: take_list_view reuses child values and preserves semantics.

Offsets/sizes are taken correctly; nulls handled; child data reused. Looks solid.

arrow-data/src/transform/mod.rs (1)

270-272: ListView transform likely drops child values (child_data not extended).

build_extend delegates to list_view::build_extend but that only pushes offsets/sizes; no child_data writes occur. freeze() later uses data.child_data[0].freeze(), which will be empty if never extended. Kernels relying on MutableArrayData (e.g., filter) could produce ListView arrays with empty values.

Please verify ListView paths that use MutableArrayData (filter, fallback concat, etc.). If broken, fix either:

  • Update arrow-data/src/transform/list_view.rs build_extend to also extend child_data[0] with the source values (copy approach), or
  • Teach into_builder for ListView to reuse the input child ArrayData(s) when possible (like Dictionary/ByteView do), or
  • Provide dedicated filter/take/concat kernels for ListView and avoid MutableArrayData for it altogether.

Run to locate potential call sites:

Also applies to: 317-319, 792-799

arrow-select/src/filter.rs (2)

349-355: ListView branches wired correctly.

Type-directed as_list_view::<i32/i64>() inference is fine and keeps the dispatch symmetric with other types.


869-895: Implementation looks good; preserve child by reference.

Offsets/sizes/nulls are filtered as expected and child values are reused. No functional issues spotted.

Please run existing fuzz tests with ListView enabled in CI to exercise mixed null/run selectivity cases (already covered for other types).

arrow/tests/array_equal.rs (2)

25-31: Imports for ListView tests are appropriate.


825-833: Wrappers for i32/i64 offset variants look good.

arrow-data/src/equal/mod.rs (1)

33-47: Wiring for ListView equality added correctly.

Comment on lines +68 to +81
for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
{
let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = size.to_usize().unwrap();

// Check if offsets are valid for the given range
if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix wrong zip and use per-range sizes to avoid mismatches on slices.

You’re zipping offsets with the full lhs_sizes buffer, not the sliced range. This misaligns sizes when lhs_start > 0. Replace the loop with index-based access over the sliced ranges.

Apply:

-        for ((&lhs_offset, &rhs_offset), &size) in lhs_range_offsets
-            .iter()
-            .zip(rhs_range_offsets)
-            .zip(lhs_sizes)
-        {
-            let lhs_offset = lhs_offset.to_usize().unwrap();
-            let rhs_offset = rhs_offset.to_usize().unwrap();
-            let size = size.to_usize().unwrap();
-
-            // Check if offsets are valid for the given range
-            if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
-                return false;
-            }
-        }
+        for i in 0..len {
+            let lhs_offset = match lhs_range_offsets[i].to_usize() {
+                Some(v) => v,
+                None => return false,
+            };
+            let rhs_offset = match rhs_range_offsets[i].to_usize() {
+                Some(v) => v,
+                None => return false,
+            };
+            let size = match lhs_range_sizes[i].to_usize() {
+                Some(v) => v,
+                None => return false,
+            };
+            if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+                return false;
+            }
+        }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 68 to 81, the loop zips
lhs_range_offsets and rhs_range_offsets with the entire lhs_sizes buffer,
causing size mismatches when lhs_start > 0; replace the zipped iteration with an
index-based loop over the sliced ranges (use the length of the range_offsets
slice), and for each index read lhs_range_offsets[i], rhs_range_offsets[i], and
the corresponding per-range size from the sliced lhs_sizes (or the correct
offset into lhs_sizes) converting each to usize and then call equal_values with
those values; ensure you iterate exactly over the number of ranges so offsets
and sizes stay aligned.

Comment on lines +104 to +125
for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
.enumerate()
{
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);

if lhs_is_null != rhs_is_null {
return false;
}

let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = size.to_usize().unwrap();

// Check if values match in the range
if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Nullable path: enforce per-index size equality and fix the same zip bug.

When both sides are non-null you must first check sizes[i] equality; otherwise differing lengths can compare equal by truncation. Also switch to the sliced size range.

Apply:

-        for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
-            .iter()
-            .zip(rhs_range_offsets)
-            .zip(lhs_sizes)
-            .enumerate()
-        {
-            let lhs_is_null = lhs_nulls.is_null(index);
-            let rhs_is_null = rhs_nulls.is_null(index);
-
-            if lhs_is_null != rhs_is_null {
-                return false;
-            }
-
-            let lhs_offset = lhs_offset.to_usize().unwrap();
-            let rhs_offset = rhs_offset.to_usize().unwrap();
-            let size = size.to_usize().unwrap();
-
-            // Check if values match in the range
-            if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
-                return false;
-            }
-        }
+        for index in 0..len {
+            let lhs_is_null = lhs_nulls.is_null(index);
+            let rhs_is_null = rhs_nulls.is_null(index);
+            if lhs_is_null != rhs_is_null {
+                return false;
+            }
+            if !lhs_is_null {
+                if lhs_range_sizes[index] != rhs_range_sizes[index] {
+                    return false;
+                }
+                let lhs_offset = match lhs_range_offsets[index].to_usize() {
+                    Some(v) => v,
+                    None => return false,
+                };
+                let rhs_offset = match rhs_range_offsets[index].to_usize() {
+                    Some(v) => v,
+                    None => return false,
+                };
+                let size = match lhs_range_sizes[index].to_usize() {
+                    Some(v) => v,
+                    None => return false,
+                };
+                if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+                    return false;
+                }
+            }
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_sizes)
.enumerate()
{
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);
if lhs_is_null != rhs_is_null {
return false;
}
let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = size.to_usize().unwrap();
// Check if values match in the range
if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
for index in 0..len {
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);
if lhs_is_null != rhs_is_null {
return false;
}
if !lhs_is_null {
if lhs_range_sizes[index] != rhs_range_sizes[index] {
return false;
}
let lhs_offset = match lhs_range_offsets[index].to_usize() {
Some(v) => v,
None => return false,
};
let rhs_offset = match rhs_range_offsets[index].to_usize() {
Some(v) => v,
None => return false,
};
let size = match lhs_range_sizes[index].to_usize() {
Some(v) => v,
None => return false,
};
if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
}
🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 104 to 125, the nullable
branch currently can compare lists of different lengths as equal and also uses
the wrong zip (causing misaligned size indexing); fix by iterating over the
three aligned iterators (lhs_range_offsets, rhs_range_offsets, and the sliced
sizes iterator) so each index gets its corresponding size, then for each index
when both sides are non-null first compare sizes[i] for equality and return
false if they differ, and only if sizes match call equal_values with the
per-index offsets and that size; ensure you use the correctly sliced sizes
iterator (not the whole lhs_sizes misaligned with zips) so indices line up.

Comment on lines +109 to 111
DataType::ListView(_) => list_view_equal::<i32>(lhs, rhs, lhs_start, rhs_start, len),
DataType::LargeListView(_) => list_view_equal::<i64>(lhs, rhs, lhs_start, rhs_start, len),
DataType::FixedSizeList(_, _) => fixed_list_equal(lhs, rhs, lhs_start, rhs_start, len),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

ListView/LargeListView equality dispatch is correct; verify inner fn slices sizes correctly.

list_view_equal should iterate the sliced sizes for the range, not the full slice. Please confirm it zips with lhs_range_sizes/rhs_range_sizes, not the entire lhs_sizes.

If found, change:

  • zip(lhs_sizes) -> zip(lhs_range_sizes.iter()) (and analogously in the branch with enumerate()).

🏁 Script executed:

#!/bin/bash
# Locate potential misuse of full sizes slice inside list_view_equal
rg -nP 'list_view_equal.*\{(?s).*zip\(\s*lhs_sizes\s*\)' arrow-data/src/equal/list_view.rs -C2
rg -nP 'zip\(\s*lhs_sizes\s*\)' arrow-data/src/equal/list_view.rs -n -C3

Length of output: 589


This script will help retrieve the complete function implementation.


🏁 Script executed:

#!/bin/bash
# Extract the full list_view_equal function definition
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 60

Length of output: 73


🏁 Script executed:

#!/bin/bash
# Extract the full list_view_equal function definition
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 60

Length of output: 2225


🏁 Script executed:

#!/bin/bash
# Extract the rest of the function to see the null handling branch
rg -n 'fn list_view_equal' arrow-data/src/equal/list_view.rs -A 100 | tail -50

Length of output: 2071


Fix two instances of incorrect buffer slicing in list_view_equal.

The function correctly creates lhs_range_sizes and rhs_range_sizes slices at lines 49/85, but then incorrectly zips with the full lhs_sizes buffer at lines 71 and 107. This causes the loop to iterate over wrong indices when lhs_start > 0.

Required changes:

  • Line 71: .zip(lhs_sizes).zip(lhs_range_sizes.iter())
  • Line 107: .zip(lhs_sizes).zip(lhs_range_sizes.iter())
🤖 Prompt for AI Agents
In arrow-data/src/equal/mod.rs around lines 49–111, list_view_equal creates
lhs_range_sizes and rhs_range_sizes slices but later zips with the full
lhs_sizes buffer causing incorrect iteration when lhs_start > 0; change the two
zips at line 71 and line 107 to zip with lhs_range_sizes.iter() instead of
lhs_sizes so the loop iterates over the sliced ranges (i.e., replace
.zip(lhs_sizes) with .zip(lhs_range_sizes.iter()) at both places).

Comment on lines +244 to +253
let mut offsets = MutableBuffer::with_capacity(lists.iter().map(|l| l.offsets().len()).sum());
let mut global_offset = OffsetSize::zero();
for l in lists.iter() {
for &offset in l.offsets() {
offsets.push(offset + global_offset);
}

// advance the offsets
global_offset += OffsetSize::from_usize(l.values().len()).unwrap();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against offset overflow when adjusting by global_offset.

For i32 offsets, adding global_offset can overflow. Use checked_add and return OffsetOverflowError.

Apply:

-    for l in lists.iter() {
-        for &offset in l.offsets() {
-            offsets.push(offset + global_offset);
-        }
-
-        // advance the offsets
-        global_offset += OffsetSize::from_usize(l.values().len()).unwrap();
-    }
+    for l in lists.iter() {
+        for &offset in l.offsets() {
+            let adj = offset
+                .checked_add(&global_offset)
+                .ok_or_else(|| ArrowError::OffsetOverflowError(
+                    (offset.as_usize() as usize)
+                ))?;
+            offsets.push(adj);
+        }
+        // advance the offsets
+        let incr = OffsetSize::from_usize(l.values().len())
+            .ok_or_else(|| ArrowError::OffsetOverflowError(l.values().len()))?;
+        global_offset = global_offset
+            .checked_add(&incr)
+            .ok_or_else(|| ArrowError::OffsetOverflowError(l.values().len()))?;
+    }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In arrow-select/src/concat.rs around lines 244 to 253, the code adds each list
offset to a running global_offset using plain addition which can overflow for
i32 offsets; replace the direct addition with a checked_add on the OffsetSize
(or its underlying integer) and if None return/propagate an OffsetOverflowError;
likewise when advancing global_offset, use conversions that check for overflow
(e.g., OffsetSize::from_usize(...)? or checked_add) so any overflow returns
OffsetOverflowError instead of wrapping. Ensure the function signature
propagates the error (or maps it into the crate's OffsetOverflowError) and
update callers if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants