Skip to content

8785: perf: override ArrayIter default impl for nth, nth_back, last and count#13

Open
martin-augment wants to merge 3 commits intomainfrom
pr-8785-2025-11-05-09-07-33
Open

8785: perf: override ArrayIter default impl for nth, nth_back, last and count#13
martin-augment wants to merge 3 commits intomainfrom
pr-8785-2025-11-05-09-07-33

Conversation

@martin-augment
Copy link
Owner

@martin-augment martin-augment commented Nov 5, 2025

8785: To review by AI


Note

Optimizes ArrayIter by custom implementations of nth, nth_back, last, and count, and avoids retaining NullBuffer when there are no nulls.

  • Iterator performance (arrow-array/src/iterator.rs):
    • Only set logical_nulls when null_count() > 0 to skip zero-null buffers.
    • Override Iterator methods: nth, last, count with bound-checked, fast paths.
    • Override DoubleEndedIterator::nth_back with symmetric fast path.

Written by Cursor Bugbot for commit b736a6e. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces comprehensive error handling improvements throughout the Parquet module, refactors in-memory row group handling with new push-based decoder infrastructure, enhances Arrow iterators with Clone derivation and advanced methods, improves schema parsing for List variants, adds scalar-based zipping support, and updates documentation across multiple README files.

Changes

Cohort / File(s) Summary
Workflow and Documentation Updates
.github/workflows/docs.yml, arrow/README.md, parquet/README.md, arrow/src/lib.rs, parquet/src/lib.rs
Updated GitHub Actions download-artifact from v5 to v6; added Arrow and Parquet implementation status documentation sections; removed integration-test references and updated API docs.
Iterator Enhancements
arrow-array/src/iterator.rs
Added Clone to ArrayIter derive, narrowed bounds in size_hint using logical_nulls, implemented nth/last/count methods, added nth_back to DoubleEndedIterator, and expanded test coverage with extensive iteration pattern validation.
Schema Type System Improvements
arrow-schema/src/datatype_parse.rs, arrow-schema/src/datatype_display.rs
Extended parser/tokenizer to support ListView and LargeListView variants with field names and nullable flags; introduced FormatMetadata adapter for consistent sorted metadata rendering.
Zipping and Selection Operations
arrow-select/src/zip.rs
Introduced ScalarZipper struct and ZipImpl trait for comprehensive scalar-based zipping; added specialized implementations for primitives, UTF8/bytes, and large types with null semantics preservation.
Parquet Error Handling Infrastructure
parquet/src/encodings/rle.rs, parquet/src/column/reader/decoder.rs, parquet/src/column/reader.rs, parquet/src/encodings/decoding.rs
Propagated errors from set_data and reload methods by changing signatures to return Result<()> and Result respectively; updated all call sites to use ? operator for error propagation.
Parquet Dictionary Decoder Updates
parquet/src/arrow/decoder/dictionary_index.rs, parquet/src/arrow/array_reader/byte_array.rs, parquet/src/arrow/array_reader/byte_array_dictionary.rs, parquet/src/arrow/array_reader/byte_view_array.rs, parquet/src/arrow/array_reader/fixed_len_byte_array.rs
Updated DictIndexDecoder::new to return Result with error propagation; refactored ByteArrayDecoderDictionary and ByteViewArrayDecoderDictionary constructors to handle fallible initialization.
In-Memory Row Group Infrastructure
parquet/src/arrow/in_memory_row_group.rs
Introduced InMemoryRowGroup, FetchRanges, ColumnChunkData structures; added fetch_ranges computation for projection-aware data fetching and fill_column_chunks for sparse/dense column population; implemented RowGroups trait.
Push-Based Decoder Architecture
parquet/src/arrow/push_decoder/mod.rs, parquet/src/arrow/push_decoder/reader_builder/mod.rs, parquet/src/arrow/push_decoder/reader_builder/data.rs, parquet/src/arrow/push_decoder/reader_builder/filter.rs, parquet/src/arrow/push_decoder/remaining.rs
Implemented ParquetPushDecoder with state machine for range-driven decoding; added RowGroupReaderBuilder, RemainingRowGroups, DataRequest/DataRequestBuilder, and FilterInfo for multi-predicate filtering with caching.
Async Reader Refactoring
parquet/src/arrow/async_reader/mod.rs
Replaced embedded InMemoryRowGroup and ColumnChunkData structures with streamlined fetch_ranges and fill_column_chunks approach; removed test_cache_projection_excludes_nested_columns.
Parquet Reader and Metadata Updates
parquet/src/arrow/arrow_reader/mod.rs, parquet/src/arrow/arrow_reader/read_plan.rs, parquet/src/arrow/arrow_reader/selection.rs, parquet/src/file/metadata/reader.rs, parquet/src/file/serialized_reader.rs
Implemented Debug trait for ParquetRecordBatchReader; added Clone and Debug derives to ReadPlanBuilder/ReadPlan; removed async feature gates from selection/read_plan accessors; added metadata_size getter; optimized page decoding buffer handling.
Parquet Module Organization
parquet/src/arrow/mod.rs
Added pub mod push_decoder and mod in_memory_row_group declarations.
Variant Compute Error Handling
parquet-variant-compute/src/type_conversion.rs, parquet-variant-compute/src/variant_array.rs, parquet-variant-compute/src/variant_get.rs
Modified non_generic_conversion_single_value to return Result; introduced generic_conversion_single_value_with_result macro; updated VariantType/VariantArray methods to use Result-based error handling; modified variant_get logic to handle cast failures with null substitution in safe mode.
Parquet Thrift and Utilities
parquet/src/parquet_thrift.rs, parquet/src/util/push_buffers.rs, parquet/src/record_reader/definition_levels.rs
Optimized VLQ reading with fast path for single-byte values; added buffered_bytes and clear_ranges methods to PushBuffers; updated DefinitionLevelBufferDecoder::set_data to return Result<()>.
Column and Layout Utilities
parquet/src/column/reader.rs, parquet/src/bin/parquet-layout.rs, parquet/src/file/metadata/mod.rs
Propagated set_data errors at repetition/definition level call sites; introduced Index and Footer structures in parquet-layout; removed extraneous doc comment.
Test Infrastructure
arrow-flight/tests/flight_sql_client_cli.rs, parquet/tests/arrow_reader/predicate_cache.rs
Extracted flight_sql_client_cmd helper function; added nested struct testing for predicate cache with NESTED_TEST_FILE_DATA support.
Memory Optimizations
arrow-cast/src/cast/run_array.rs
Pre-allocated vector capacity using Vec::with_capacity(partitions.len()) instead of empty Vec for two vectors in conversion logic.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8785-2025-11-05-09-07-33

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review - PR #13

Summary

This PR implements performance optimizations for ArrayIter by overriding default iterator trait methods (nth, nth_back, last, and count), along with substantial additions to the parquet crate including a new push decoder API and various other improvements.

Code Quality & Best Practices

✅ Strengths

  1. Iterator Optimizations (arrow-array/src/iterator.rs)

    • The overrides of nth, nth_back, last, and count follow Rust best practices for iterator specialization
    • Implementation correctly uses checked_add/checked_sub to prevent overflow
    • Added Clone derive to ArrayIter which is appropriate for the struct
    • The optimization filtering out null buffers with zero null_count (line 59) is a smart micro-optimization
  2. Comprehensive Test Coverage

    • Extensive test suite (~875 new lines of tests) with excellent coverage of edge cases
    • Tests cover multiple scenarios: no nulls, all nulls, partial nulls, random values, duplicates
    • Tests validate behavior after various consumption patterns (forward, backward, partial)
    • Excellent use of property-based testing patterns to verify equivalence with slice iterators
  3. Documentation

    • Good inline comments explaining the logic in nth and nth_back
    • Clear safety comments for unsafe operations

⚠️ Issues & Concerns

1. Critical Bug in nth Implementation (arrow-array/src/iterator.rs:107-123)

The current implementation needs verification for correctness with Rust iterator contract.

The code advances by n positions and then calls next(). Per Rust docs, nth(0) should equal next(), and nth(n) consumes n+1 elements total. The implementation appears correct, but verification against the extensive tests would confirm this.

2. Potential Performance Regression in Parquet Changes

The parquet changes are extensive (3000+ lines added), including:

  • New push_decoder module with significant complexity
  • New in_memory_row_group module
  • Changes to async reader removing 302 lines and adding only 19

Concerns:

  • The async reader changes show a net deletion of 283 lines - please ensure benchmarks confirm no performance regression
  • The new push decoder API adds significant surface area - ensure it is well-tested with integration tests

3. Missing Inline Attributes

While nth, nth_back, last, and count have #[inline] attributes, consider adding inline hints to hot path methods like is_null if profiling shows benefit.

Performance Considerations

✅ Positive

  • Direct implementation of nth, nth_back, last, and count avoids default trait implementations which may be less efficient
  • The count() implementation using self.len() is optimal O(1) vs iterating
  • The last() implementation as self.next_back() is optimal O(1) vs consuming the entire iterator
  • Filtering null buffers with zero nulls avoids unnecessary null checks

⚠️ Considerations

  • Large PR scope: This PR mixes iterator optimizations with extensive parquet changes. Consider splitting for easier review and rollback if needed.
  • Benchmark coverage: Ensure benchmarks demonstrate the performance improvement from these optimizations.

Security Concerns

✅ Good

  • Proper use of checked_add and checked_sub prevents integer overflow
  • Unsafe code in next() and next_back() has safety comments and appears correct
  • No new unsafe code introduced in the iterator implementations

Test Coverage

✅ Excellent

The test coverage for the iterator changes is exceptional:

  • 13 different test case categories
  • Tests for position, rposition, nth, nth_back, last, for_each, fold, rfold, count, any, all, find, rfind
  • Comprehensive edge case coverage including empty iterators, single elements, all nulls, etc.

⚠️ Parquet Tests

  • Only 84 new lines of tests added for parquet changes
  • The 3000+ lines of new parquet code needs proportional test coverage
  • Verify integration tests exist for the new push decoder API

Recommendations

High Priority

  1. Split the PR: Consider separating iterator optimizations from parquet changes for easier review
  2. Add benchmarks: Include before/after performance numbers for the iterator optimizations
  3. Review parquet changes: The parquet additions need a focused security and correctness review

Medium Priority

  1. Test the new parquet APIs: Ensure comprehensive integration tests for push decoder
  2. Review schema parsing changes: The datatype_parse.rs changes need validation

Overall Assessment

The iterator optimizations appear sound with excellent test coverage. However, this PR is attempting to do too much at once by including extensive parquet changes (push decoder, in-memory row groups, zip function, schema parsing) which makes thorough review difficult.

Recommendation: Consider splitting this PR into separate, focused changes for easier review and rollback if needed.

Overall: Strong work on the iterator optimizations and testing, but the PR scope is too broad for effective review.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.


let no_nulls_and_no_duplicates = (0..10).map(Some).collect::<Vec<Option<i32>>>();
let no_nulls_random_values = (0..10)
.map(|_| rng.random::<i32>())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rand::Rng trait doesn’t provide random::<T>() or random_bool; this will fail to compile with rand 0.8+ (also applies to other usages below). Consider using the appropriate Rng APIs instead.

🤖 React with 👍 or 👎 to let us know if the comment was useful.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The AI reviewer is wrong! The rand crate does provide these functions in version 0.9.2 (the latest available at the moment). All CI checks pass for this PR, so there is no problem with the compilation.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

Findings

  • arrow-array/src/iterator.rs:1186 – rand::random_bool isn’t a function in the rand crate, so this won’t compile. Use an RNG instance (rng.random_bool(0.5)) or rand::random::<bool>() instead.

Next Steps

  • 1 Fix the call to rand::random_bool and rerun the iterator tests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
arrow-schema/src/datatype_parse.rs (1)

646-662: Restore escape-state reset in parse_quoted_string

After reading a backslash we never clear is_escaped unless the next character is the quote itself. That means any field name containing a bare \ (for example the text List(nullable Int64, field: 'foo\bar'), which DataType::to_string() emits for a list item named foo\bar) causes us to treat the closing ' as escaped and run off the end of the token stream. Please reset the escape flag after consuming a non-delimiter and toggle it for consecutive backslashes so these inputs parse correctly.

                 Some(c) => match c {
                     '\\' => {
-                        is_escaped = true;
-                        self.word.push(c);
+                        self.word.push(c);
+                        if is_escaped {
+                            is_escaped = false;
+                        } else {
+                            is_escaped = true;
+                        }
                     }
                     c if c == quote => {
                         if is_escaped {
                             self.word.push(c);
                             is_escaped = false;
                         } else {
                             break;
                         }
                     }
                     c => {
-                        self.word.push(c);
+                        if is_escaped {
+                            is_escaped = false;
+                        }
+                        self.word.push(c);
                     }
                 },
🧹 Nitpick comments (1)
parquet/README.md (1)

76-79: Minor: Standardize reference link casing for consistency.

The reference link on line 79 uses lowercase [implementation status page] while the link text on line 76 uses [Implementation Status Page] (mixed case). Although Markdown treats reference links as case-insensitive, for consistency and maintainability—especially since arrow/README.md maintains matching case—please standardize the casing here.

Apply this diff to match the case:

- [implementation status page]: https://parquet.apache.org/docs/file-format/implementationstatus/
+ [Implementation Status Page]: https://parquet.apache.org/docs/file-format/implementationstatus/
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 477bfe4 and 565fd1f.

📒 Files selected for processing (42)
  • .github/workflows/docs.yml (1 hunks)
  • arrow-array/src/iterator.rs (6 hunks)
  • arrow-cast/src/cast/run_array.rs (1 hunks)
  • arrow-flight/tests/flight_sql_client_cli.rs (7 hunks)
  • arrow-schema/src/datatype_display.rs (3 hunks)
  • arrow-schema/src/datatype_parse.rs (16 hunks)
  • arrow-select/src/zip.rs (10 hunks)
  • arrow/README.md (1 hunks)
  • arrow/src/lib.rs (0 hunks)
  • parquet-variant-compute/src/type_conversion.rs (2 hunks)
  • parquet-variant-compute/src/variant_array.rs (20 hunks)
  • parquet-variant-compute/src/variant_get.rs (2 hunks)
  • parquet/README.md (1 hunks)
  • parquet/src/arrow/array_reader/byte_array.rs (2 hunks)
  • parquet/src/arrow/array_reader/byte_array_dictionary.rs (1 hunks)
  • parquet/src/arrow/array_reader/byte_view_array.rs (2 hunks)
  • parquet/src/arrow/array_reader/fixed_len_byte_array.rs (1 hunks)
  • parquet/src/arrow/arrow_reader/mod.rs (3 hunks)
  • parquet/src/arrow/arrow_reader/read_plan.rs (2 hunks)
  • parquet/src/arrow/arrow_reader/selection.rs (0 hunks)
  • parquet/src/arrow/async_reader/mod.rs (4 hunks)
  • parquet/src/arrow/decoder/dictionary_index.rs (1 hunks)
  • parquet/src/arrow/in_memory_row_group.rs (1 hunks)
  • parquet/src/arrow/mod.rs (1 hunks)
  • parquet/src/arrow/push_decoder/mod.rs (1 hunks)
  • parquet/src/arrow/push_decoder/reader_builder/data.rs (1 hunks)
  • parquet/src/arrow/push_decoder/reader_builder/filter.rs (1 hunks)
  • parquet/src/arrow/push_decoder/reader_builder/mod.rs (1 hunks)
  • parquet/src/arrow/push_decoder/remaining.rs (1 hunks)
  • parquet/src/arrow/record_reader/definition_levels.rs (1 hunks)
  • parquet/src/bin/parquet-layout.rs (5 hunks)
  • parquet/src/column/reader.rs (4 hunks)
  • parquet/src/column/reader/decoder.rs (6 hunks)
  • parquet/src/encodings/decoding.rs (2 hunks)
  • parquet/src/encodings/rle.rs (29 hunks)
  • parquet/src/file/metadata/mod.rs (0 hunks)
  • parquet/src/file/metadata/reader.rs (1 hunks)
  • parquet/src/file/serialized_reader.rs (2 hunks)
  • parquet/src/lib.rs (2 hunks)
  • parquet/src/parquet_thrift.rs (1 hunks)
  • parquet/src/util/push_buffers.rs (1 hunks)
  • parquet/tests/arrow_reader/predicate_cache.rs (5 hunks)
💤 Files with no reviewable changes (3)
  • parquet/src/arrow/arrow_reader/selection.rs
  • parquet/src/file/metadata/mod.rs
  • arrow/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (25)
parquet/src/arrow/record_reader/definition_levels.rs (1)
parquet/src/column/reader/decoder.rs (5)
  • set_data (35-35)
  • set_data (115-121)
  • set_data (199-229)
  • set_data (313-316)
  • set_data (417-422)
parquet/src/arrow/arrow_reader/mod.rs (3)
parquet/src/arrow/async_reader/mod.rs (2)
  • fmt (774-781)
  • fmt (821-829)
parquet/src/file/writer.rs (1)
  • fmt (173-181)
parquet/src/arrow/arrow_writer/mod.rs (3)
  • fmt (193-202)
  • fmt (687-691)
  • fmt (807-809)
parquet/src/encodings/rle.rs (4)
parquet/src/arrow/record_reader/definition_levels.rs (2)
  • set_data (134-140)
  • set_data (274-286)
parquet/src/column/reader/decoder.rs (5)
  • set_data (35-35)
  • set_data (115-121)
  • set_data (199-229)
  • set_data (313-316)
  • set_data (417-422)
parquet/src/encodings/decoding.rs (13)
  • set_data (175-175)
  • set_data (306-309)
  • set_data (382-400)
  • set_data (460-479)
  • set_data (665-728)
  • set_data (907-926)
  • set_data (1037-1059)
  • new (290-301)
  • new (362-369)
  • new (449-455)
  • new (561-577)
  • new (894-903)
  • bit_util (469-469)
parquet/src/util/bit_util.rs (9)
  • bit_reader (742-742)
  • bit_reader (744-744)
  • bit_reader (746-746)
  • bit_reader (748-748)
  • new (190-196)
  • new (384-391)
  • ceil (153-155)
  • buffer (268-270)
  • from (709-711)
parquet/src/arrow/decoder/dictionary_index.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
  • new (87-99)
  • new (179-186)
  • new (262-294)
  • new (348-360)
  • new (442-474)
  • new (530-535)
  • new (566-570)
parquet/src/arrow/array_reader/byte_array.rs (1)
parquet/src/arrow/array_reader/byte_view_array.rs (7)
  • new (76-88)
  • new (139-146)
  • new (222-256)
  • new (305-317)
  • new (429-433)
  • new (518-550)
  • new (605-610)
parquet/src/bin/parquet-layout.rs (3)
parquet/src/file/serialized_reader.rs (7)
  • new (117-119)
  • new (181-189)
  • new (282-304)
  • new (532-546)
  • metadata (245-247)
  • metadata (289-293)
  • metadata (308-310)
parquet/src/file/metadata/reader.rs (2)
  • new (101-103)
  • metadata_size (566-568)
parquet/src/file/metadata/mod.rs (9)
  • new (204-213)
  • new (354-356)
  • new (451-459)
  • new (493-513)
  • new (715-725)
  • new (1163-1193)
  • new (1386-1398)
  • new (1552-1560)
  • column (645-647)
parquet/src/arrow/array_reader/fixed_len_byte_array.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
  • new (87-99)
  • new (179-186)
  • new (262-294)
  • new (348-360)
  • new (442-474)
  • new (530-535)
  • new (566-570)
parquet/src/arrow/push_decoder/remaining.rs (6)
parquet/src/arrow/arrow_reader/selection.rs (1)
  • row_count (438-440)
parquet/src/file/metadata/mod.rs (6)
  • new (204-213)
  • new (354-356)
  • new (451-459)
  • new (493-513)
  • row_groups (250-252)
  • row_groups (385-387)
parquet/src/arrow/push_decoder/reader_builder/data.rs (1)
  • new (138-155)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (3)
  • new (171-196)
  • push_data (199-201)
  • buffered_bytes (204-206)
parquet/src/util/push_buffers.rs (2)
  • new (79-86)
  • buffered_bytes (135-137)
parquet/src/arrow/push_decoder/mod.rs (5)
  • push_data (410-440)
  • ranges (1112-1115)
  • ranges (1120-1123)
  • buffered_bytes (312-314)
  • buffered_bytes (443-454)
parquet/src/arrow/array_reader/byte_view_array.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
  • new (87-99)
  • new (179-186)
  • new (262-294)
  • new (348-360)
  • new (442-474)
  • new (530-535)
  • new (566-570)
arrow-select/src/zip.rs (4)
arrow-data/src/transform/mod.rs (1)
  • new (401-403)
arrow-data/src/data.rs (8)
  • new (1857-1859)
  • new (1919-1932)
  • from (2128-2141)
  • data_type (362-364)
  • data_type (1935-1937)
  • nulls (390-392)
  • nulls (1948-1953)
  • new_null (591-707)
arrow-select/src/filter.rs (10)
  • new (60-62)
  • new (89-93)
  • new (217-231)
  • new (636-653)
  • vec (1837-1839)
  • from (66-68)
  • iter (595-595)
  • iter (1517-1517)
  • iter (1530-1530)
  • iter (1543-1543)
arrow-buffer/src/buffer/offset.rs (2)
  • new_zeroed (98-105)
  • from_repeated_length (153-174)
parquet/src/util/push_buffers.rs (3)
parquet/src/arrow/push_decoder/mod.rs (2)
  • buffered_bytes (312-314)
  • buffered_bytes (443-454)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (1)
  • buffered_bytes (204-206)
parquet/src/arrow/push_decoder/remaining.rs (1)
  • buffered_bytes (69-71)
arrow-array/src/iterator.rs (2)
arrow-array/src/array/null_array.rs (3)
  • logical_nulls (116-118)
  • new (56-58)
  • len (100-102)
arrow-array/src/array/mod.rs (22)
  • logical_nulls (225-227)
  • logical_nulls (391-393)
  • logical_nulls (461-463)
  • array (944-944)
  • array (953-953)
  • array (962-962)
  • array (970-970)
  • array (980-980)
  • array (994-994)
  • array (1008-1008)
  • array (1020-1020)
  • array (1042-1042)
  • array (1132-1135)
  • iter (601-601)
  • iter (609-611)
  • iter (618-620)
  • iter (632-632)
  • iter (636-638)
  • iter (641-643)
  • len (155-155)
  • len (366-368)
  • len (445-447)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (7)
parquet/src/file/metadata/mod.rs (6)
  • schema (563-565)
  • new (204-213)
  • new (354-356)
  • new (451-459)
  • new (493-513)
  • row_group (245-247)
parquet/src/arrow/arrow_reader/read_plan.rs (4)
  • new (40-45)
  • new (160-167)
  • batch_size (247-249)
  • selection (54-56)
parquet/src/arrow/push_decoder/reader_builder/data.rs (2)
  • new (138-155)
  • needed_ranges (49-55)
parquet/src/arrow/push_decoder/reader_builder/filter.rs (3)
  • new (58-66)
  • new (82-88)
  • cache_projection (130-132)
parquet/src/util/push_buffers.rs (2)
  • new (79-86)
  • buffered_bytes (135-137)
parquet/src/arrow/in_memory_row_group.rs (1)
  • column_chunks (196-228)
parquet/src/arrow/mod.rs (2)
  • none (284-288)
  • leaves (295-301)
parquet/src/arrow/push_decoder/reader_builder/data.rs (3)
parquet/src/file/metadata/mod.rs (8)
  • offset_index (270-272)
  • offset_index (417-419)
  • new (204-213)
  • new (354-356)
  • new (451-459)
  • new (493-513)
  • len (889-891)
  • row_group (245-247)
parquet/src/arrow/in_memory_row_group.rs (2)
  • column_chunks (196-228)
  • len (274-279)
parquet/src/util/push_buffers.rs (2)
  • new (79-86)
  • len (160-162)
parquet-variant-compute/src/variant_get.rs (2)
parquet-variant-compute/src/variant_array.rs (2)
  • value (339-341)
  • v (1426-1426)
arrow-cast/src/cast/mod.rs (1)
  • default (84-89)
parquet/src/arrow/push_decoder/mod.rs (6)
parquet/src/arrow/arrow_reader/mod.rs (29)
  • ParquetRecordBatchReader (2028-2030)
  • ParquetRecordBatchReader (2061-2063)
  • ParquetRecordBatchReader (2096-2098)
  • ParquetRecordBatchReader (2371-2373)
  • ParquetRecordBatchReader (2428-2430)
  • ParquetRecordBatchReader (4404-4406)
  • ParquetRecordBatchReader (4449-4451)
  • ParquetRecordBatchReader (4481-4483)
  • ParquetRecordBatchReader (4552-4554)
  • metadata (168-170)
  • metadata (683-685)
  • try_new (590-612)
  • try_new (744-746)
  • try_new (1108-1112)
  • build (858-920)
  • new (397-399)
  • new (1143-1154)
  • new (2518-2525)
  • batch_size (1157-1159)
  • batch (2123-2126)
  • batch (2151-2154)
  • batch (2183-2186)
  • batch (2209-2209)
  • batch (2210-2210)
  • batch (2240-2240)
  • batch (2241-2241)
  • batch (2249-2249)
  • batch (2250-2250)
  • batch (2258-2258)
parquet/src/util/push_buffers.rs (5)
  • file_len (123-125)
  • new (79-86)
  • push_range (101-109)
  • push_ranges (89-98)
  • buffered_bytes (135-137)
parquet/src/arrow/push_decoder/reader_builder/data.rs (2)
  • build (181-221)
  • new (138-155)
parquet/src/arrow/push_decoder/reader_builder/filter.rs (3)
  • new (58-66)
  • new (82-88)
  • builder (68-70)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (3)
  • new (171-196)
  • buffered_bytes (204-206)
  • push_data (199-201)
parquet/src/arrow/push_decoder/remaining.rs (3)
  • new (49-61)
  • buffered_bytes (69-71)
  • push_data (64-66)
arrow-schema/src/datatype_display.rs (2)
arrow-schema/src/field.rs (2)
  • metadata (373-375)
  • metadata (963-967)
arrow-schema/src/schema.rs (2)
  • metadata (81-83)
  • metadata (410-412)
parquet-variant-compute/src/type_conversion.rs (1)
parquet-variant/src/variant.rs (3)
  • from (75-77)
  • from (1486-1488)
  • from (1492-1497)
parquet/src/column/reader/decoder.rs (1)
parquet/src/encodings/rle.rs (12)
  • set_data (324-336)
  • decoder (564-564)
  • decoder (581-582)
  • decoder (626-626)
  • decoder (640-640)
  • decoder (665-666)
  • decoder (683-684)
  • decoder (700-700)
  • decoder (716-716)
  • decoder (733-734)
  • decoder (750-751)
  • decoder (1015-1016)
parquet/src/arrow/async_reader/mod.rs (3)
parquet/src/file/metadata/reader.rs (2)
  • fetch (1021-1023)
  • fetch (1034-1036)
parquet/src/arrow/async_reader/metadata.rs (2)
  • fetch (69-69)
  • fetch (73-75)
parquet/tests/arrow_reader/io/async_reader.rs (2)
  • ranges (394-401)
  • ranges (403-409)
parquet-variant-compute/src/variant_array.rs (2)
parquet-variant/src/variant/decimal.rs (1)
  • try_new (91-91)
parquet-variant/src/variant.rs (24)
  • try_new (58-66)
  • try_new (294-297)
  • new (316-321)
  • try_from (83-85)
  • try_from (1661-1665)
  • try_from (1671-1675)
  • try_from (1681-1685)
  • from (75-77)
  • from (1486-1488)
  • from (1492-1497)
  • from (1501-1503)
  • from (1507-1509)
  • from (1513-1515)
  • from (1519-1521)
  • from (1525-1532)
  • from (1536-1543)
  • from (1546-1553)
  • from (1557-1565)
  • from (1569-1571)
  • from (1575-1577)
  • from (1581-1583)
  • from (1587-1589)
  • from (1593-1595)
  • metadata (1436-1442)
parquet/tests/arrow_reader/predicate_cache.rs (1)
parquet/src/arrow/mod.rs (1)
  • leaves (295-301)
parquet/src/arrow/push_decoder/reader_builder/filter.rs (2)
parquet/src/arrow/push_decoder/reader_builder/data.rs (1)
  • new (138-155)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (1)
  • new (171-196)
arrow-schema/src/datatype_parse.rs (2)
arrow-schema/src/field.rs (5)
  • data_type (411-413)
  • new (192-202)
  • fmt (64-100)
  • fmt (972-1002)
  • new_list_field (218-220)
arrow-schema/src/datatype_display.rs (2)
  • fmt (23-178)
  • fmt (185-195)
parquet/src/arrow/in_memory_row_group.rs (2)
parquet/src/file/metadata/mod.rs (3)
  • column (645-647)
  • offset_index (270-272)
  • offset_index (417-419)
parquet/src/file/page_index/offset_index.rs (1)
  • page_locations (66-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: codex
  • GitHub Check: claude-review
🔇 Additional comments (44)
arrow-flight/tests/flight_sql_client_cli.rs (2)

49-52: LGTM! Clean DRY refactoring.

The helper function centralizes the command construction pattern, eliminating duplication across multiple test functions. The implementation correctly uses the cargo_bin! macro and is well-documented.


61-61: LGTM! Consistent application of the helper.

All test functions have been consistently updated to use the flight_sql_client_cmd() helper, maintaining the same test behavior while improving code maintainability.

Also applies to: 101-101, 139-139, 178-178, 216-216, 256-256, 297-297

arrow-schema/src/datatype_display.rs (4)

24-26: LGTM! Clean refactor to delegate formatting.

The function now properly delegates to the FormatMetadata adapter, centralizing the formatting logic.


181-196: LGTM! Well-designed formatting adapter.

The FormatMetadata adapter correctly:

  • Omits output for empty metadata
  • Sorts entries by key for deterministic output
  • Uses debug_map() for consistent formatting

The Vec allocation and sort on each display is acceptable given that metadata maps are typically small.


377-380: LGTM! Test correctly validates sorted multi-entry metadata.

The test properly verifies that multiple metadata entries are formatted with keys in sorted order.

Also applies to: 386-386


1-196: Note: PR title/description inconsistency.

The PR title mentions "override ArrayIter default impl for nth, nth_back, last and count", but this file contains display formatting refactors. This suggests either:

  • The PR includes unrelated changes, or
  • The PR description doesn't match the actual changes

Please verify the PR scope and update the title/description accordingly.

parquet/src/lib.rs (4)

76-90: Good documentation clarity improvement.

The clarification that async APIs require both arrow and async features is helpful. The inline code formatting for RecordBatch improves readability.


110-110: Reference link is appropriate.

Adding the explicit RecordBatch reference link enables proper cross-referencing in generated documentation. The use of "latest" in the docs.rs URL is standard practice.


1-221: Discrepancy between PR title and file changes.

The PR title indicates this PR implements "perf: override ArrayIter default impl for nth, nth_back, last and count", but the provided file contains only documentation updates. The AI summary also describes broader changes (error handling, push-decoder infrastructure, Clone derivation) not visible in this file.

Confirm whether:

  • This is the only file changed in this PR
  • The other files implementing the ArrayIter performance improvements are also included in this review

55-74: No issues found with documentation links.

All five types referenced in the documentation are properly exported and match their documented module paths:

  • ArrowWriter, ArrowColumnWriter via arrow::arrow_writer
  • ParquetRecordBatchReaderBuilder, RowFilter via arrow::arrow_reader (RowFilter is re-exported from filter submodule)
  • ParquetPushDecoder via arrow::push_decoder

The documentation is accurate and follows Rust doc conventions.

arrow/README.md (1)

32-37: Documentation enhancement looks good.

The new "Arrow Implementation Status" section is well-placed and follows proper Markdown reference link syntax with consistent casing. The link provides helpful navigation for users to check implementation status.

.github/workflows/docs.yml (1)

74-74: Verify breaking changes in actions/download-artifact@v6 before upgrading.

v6 requires Node.js v24.x (marked as BREAKING CHANGE), and @actions/artifact was bumped to v4.0.0, which may affect behavior/compatibility. The README was updated with artifact extraction details and clarifications about v5 extraction behavior—review these changes to ensure the workflow's artifact handling (lines 75-77) and deployment steps remain compatible with v6's new extraction behavior.

parquet/src/parquet_thrift.rs (1)

278-294: LGTM! Clean fast-path optimization for VLQ decoding.

The optimization correctly handles the common case of single-byte VLQ values (0-127) by checking the high bit and returning immediately. For multi-byte values, the first byte is processed before entering the loop, which is logically equivalent to the previous implementation but avoids unnecessary loop overhead for the fast path.

The logic is correct:

  • Single byte: byte & 0x80 == 0 means the high bit is clear, so the full byte value is the result
  • Multi-byte: First byte's lower 7 bits initialize in_progress, then the loop processes remaining bytes starting at bit position 7
arrow-array/src/iterator.rs (8)

47-47: LGTM: Clone derivation enables flexible iterator usage.

The Clone derivation is a valuable addition that allows ArrayIter to be used in patterns requiring iterator duplication (e.g., lookahead, backtracking).


59-59: LGTM: Filtering empty null buffers improves memory efficiency.

Storing only non-empty null buffers reduces memory overhead when arrays have no nulls, while the is_null method correctly handles the None case.


101-103: LGTM: size_hint now correctly uses iterator bounds.

Using current_end - current instead of array.len() - current properly accounts for consumption from both ends of the iterator.


106-123: LGTM: nth implementation is correct and safe.

The implementation properly handles overflow, bounds checking, and delegates to next() for consistent null handling. This will improve performance over the default iterator implementation which would call next() n+1 times.


125-128: LGTM: last delegates efficiently to next_back.

This is the optimal implementation for a DoubleEndedIterator with ExactSizeIterator bounds.


130-136: LGTM: count leverages ExactSizeIterator for O(1) performance.

This avoids unnecessary iteration and provides constant-time counting.


158-175: LGTM: nth_back implementation is correct and safe.

The implementation properly handles underflow, bounds checking, and delegates to next_back() for consistent null handling. Performance improvement over the default implementation is significant.


199-1192: Excellent test coverage validates iterator semantics.

The test infrastructure is comprehensive and well-designed:

  • Validates ArrayIter behavior matches standard slice iterators across diverse scenarios
  • Tests various null distributions, value patterns, and consumption patterns
  • Verifies state consistency after operations
  • Provides strong confidence in the correctness of the new implementations
arrow-cast/src/cast/run_array.rs (1)

139-141: LGTM: Preallocating capacity improves performance.

Reserving capacity upfront avoids multiple reallocations since exactly partitions.len() elements will be pushed to each vector.

parquet/src/encodings/decoding.rs (1)

396-396: LGTM! Proper error propagation.

The addition of the ? operator correctly propagates errors from RleDecoder::set_data, ensuring that initialization failures bubble up to callers rather than being silently ignored.

Also applies to: 476-476

parquet/src/arrow/mod.rs (1)

193-200: LGTM! Module declarations for push decoder infrastructure.

The new push_decoder public module, in_memory_row_group internal module, and Debug import are properly declared and align with the push-based decoder infrastructure described in the PR summary.

parquet/src/arrow/array_reader/byte_array_dictionary.rs (1)

296-296: LGTM! Consistent error propagation.

The addition of the ? operator correctly propagates errors from RleDecoder::set_data in the dictionary decoding path, consistent with the error handling improvements across the codebase.

parquet/src/arrow/decoder/dictionary_index.rs (1)

45-57: Thanks for surfacing decoder initialisation errors

Line 45: Converting DictIndexDecoder::new into a fallible constructor and propagating the set_data failure path keeps us from panicking on truncated dictionary index buffers. Looks great.

parquet/src/arrow/arrow_reader/read_plan.rs (1)

31-80: Great to have ReadPlan(Builder) implement Debug/Clone

Line 31: Deriving Clone/Debug and ungating selection()/num_rows_selected() outside cfg(async) gives the new ParquetRecordBatchReader Debug impl everything it needs. 👍

parquet/src/bin/parquet-layout.rs (1)

107-204: Footer metadata visibility is a nice addition

Line 107: Capturing metadata_size before finish() and emitting it through the new Footer struct makes the CLI output more self-contained without re-reading the file. Implementation is tidy.

parquet/src/util/push_buffers.rs (1)

134-156: Appreciate the buffer accounting helpers

Line 135: buffered_bytes() and clear_ranges() round out the state needed by the push decoder, and the implementation keeps the buffers/ranges in sync. 👍

parquet/src/arrow/array_reader/fixed_len_byte_array.rs (1)

384-386: Callers now respect the fallible dict index constructor

Line 384: Thanks for switching the dictionary path over to DictIndexDecoder::new(...)?; fixed-length byte arrays now pick up the improved error reporting as well.

parquet/src/column/reader.rs (2)

439-470: LGTM: Proper error propagation for level decoders.

The addition of error propagation via ? on set_data calls for repetition and definition level decoders improves error handling for DataPage v1. This ensures that any errors during level decoder initialization are properly bubbled up rather than silently failing.


505-528: LGTM: Consistent error handling for DataPage v2.

Error propagation is consistently applied to both repetition and definition level decoders in the DataPage v2 path, matching the improvements made to the v1 path.

parquet/src/arrow/array_reader/byte_array.rs (2)

276-278: LGTM: Error propagation from dictionary decoder construction.

The change to propagate errors from ByteArrayDecoderDictionary::new aligns with the broader error handling improvements, ensuring that failures during dictionary index decoder initialization are properly surfaced.


565-570: LGTM: Updated constructor signature.

The constructor now returns Result<Self> to propagate errors from DictIndexDecoder::new, which is consistent with the error handling pattern used in the parallel ByteViewArrayDecoderDictionary::new implementation.

parquet/tests/arrow_reader/predicate_cache.rs (3)

84-95: LGTM: Comprehensive test for nested column handling.

The test verifies that the predicate cache behaves correctly with nested columns, covering both sync and async readers. The expectation of 0 records read from cache is appropriate for nested columns.


121-131: LGTM: Well-structured test data setup.

The new_nested() constructor provides a clean way to create test data with nested struct columns (a string column and a struct with two string fields), which is essential for testing nested column scenarios.


289-323: LGTM: Appropriate test data structure.

The nested test data with a top-level string column a and a struct column b { aa, bb } provides good coverage for testing predicate filtering on nested leaves.

parquet/src/arrow/async_reader/mod.rs (1)

977-996: LGTM: Cleaner fetch implementation.

The refactored fetch method delegates to fetch_ranges and fill_column_chunks, replacing the previous scattered per-chunk page planning logic. This streamlined approach improves maintainability by centralizing the fetch-and-fill workflow in InMemoryRowGroup.

parquet/src/arrow/push_decoder/remaining.rs (2)

49-61: LGTM: Clean state machine initialization.

The constructor properly initializes all state including the row group queue, selection, and reader builder. Using VecDeque for row groups is appropriate for efficient front-popping during iteration.


76-117: LGTM: Well-structured state machine logic.

The try_next_reader method implements a clear loop-based advancement through row groups:

  1. Attempts to build a reader for the current row group
  2. Returns data requests when needed
  3. Advances to the next row group when current is finished
  4. Properly handles row count overflow and selection splitting

The error handling at lines 105-110 for row count conversion is particularly good, preventing potential overflow issues.

parquet/src/arrow/array_reader/byte_view_array.rs (1)

236-240: Nice job propagating dictionary init errors

Line 237 now returns the DictIndexDecoder::new failure up to callers, keeping the byte-view path consistent with the other decoders and preventing silent corruption when the index stream is invalid.

parquet/src/arrow/push_decoder/reader_builder/data.rs (1)

108-110: Appreciate the proactive buffer cleanup

Line 109 clearing the fulfilled ranges keeps PushBuffers bounded for long-lived decodes and matches the cache eviction expectations elsewhere.

parquet/src/arrow/push_decoder/mod.rs (1)

555-594: Test coverage for partial pushes is excellent

Lines 555-594 simulate staggered range delivery and assert the decoder resumes correctly, giving strong confidence in the new state machine.

parquet/src/arrow/push_decoder/reader_builder/filter.rs (1)

90-138: The predicate state progression is tidy

Lines 90-138 centralize predicate iteration and cache wiring, keeping the RowGroup reader logic much easier to follow.

Comment on lines 145 to 156
} else if !cast_options.safe {
let value = target.try_value(i)?;
builder.append_value(value)?;
} else {
builder.append_value(target.value(i))?;
let _ = match target.try_value(i) {
Ok(v) => builder.append_value(v)?,
Err(_) => {
builder.append_null()?;
false // add this to make match arms have the same return type
}
};
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Compile error in safe null-substitution branch

When cast_options.safe is true, the match expression mixes () from the success arm with bool from the error arm. This does not type-check, so the crate won't compile.

Please restructure the branch so both arms return (). For example:

-                let _ = match target.try_value(i) {
-                    Ok(v) => builder.append_value(v)?,
-                    Err(_) => {
-                        builder.append_null()?;
-                        false // add this to make match arms have the same return type
-                    }
-                };
+                match target.try_value(i) {
+                    Ok(v) => builder.append_value(v)?,
+                    Err(_) => builder.append_null()?,
+                }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if !cast_options.safe {
let value = target.try_value(i)?;
builder.append_value(value)?;
} else {
builder.append_value(target.value(i))?;
let _ = match target.try_value(i) {
Ok(v) => builder.append_value(v)?,
Err(_) => {
builder.append_null()?;
false // add this to make match arms have the same return type
}
};
}
} else if !cast_options.safe {
let value = target.try_value(i)?;
builder.append_value(value)?;
} else {
match target.try_value(i) {
Ok(v) => builder.append_value(v)?,
Err(_) => builder.append_null()?,
}
}
🤖 Prompt for AI Agents
In parquet-variant-compute/src/variant_get.rs around lines 145-156 the safe
null-substitution branch uses a match whose Ok arm yields () while the Err arm
yields a bool, causing a type error; change the control flow so both arms return
unit: call target.try_value(i) once with match or if let, in the Ok arm call
builder.append_value(v)?; in the Err arm call builder.append_null()?; and do not
return any boolean value (ensure both arms end with unit and propagate errors
with ?).

Comment on lines 61 to 62
/// * decoder API: [`ParquetDecoderBuilder::new`]
///
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Doc link points to a non-existent builder name

Line 61 still advertises ParquetDecoderBuilder::new, but I can't find that symbol exported anywhere—the builder lives at ParquetPushDecoderBuilder::new. Without an alias this will send readers hunting for a non-existent type. Could we rename the bullet/link text (or add an alias) so the docs match the actual API surface?

🤖 Prompt for AI Agents
In parquet/src/arrow/arrow_reader/mod.rs around lines 61 to 62, the docs
reference ParquetDecoderBuilder::new which does not exist; update the
documentation to point to the real builder name ParquetPushDecoderBuilder::new
(or add a public alias named ParquetDecoderBuilder that re-exports
ParquetPushDecoderBuilder) so the doc link resolves to an actual exported
symbol; choose either (1) change the bullet/link text and link target to
ParquetPushDecoderBuilder::new, or (2) add pub use
crate::path::ParquetPushDecoderBuilder as ParquetDecoderBuilder; re-run docs
build to confirm the link now resolves.

Comment on lines 164 to 183
length: metadata.column(idx).byte_range().1 as usize,
data: offsets
.into_iter()
.map(|x| x as usize)
.zip(chunks.into_iter())
.collect(),
}))
}
}
} else {
for (idx, chunk) in self.column_chunks.iter_mut().enumerate() {
if chunk.is_some() || !projection.leaf_included(idx) {
continue;
}

if let Some(data) = chunk_data.next() {
*chunk = Some(Arc::new(ColumnChunkData::Dense {
offset: metadata.column(idx).byte_range().0 as usize,
data,
}));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid truncating offsets on 32-bit builds.

Casting u64 offsets and lengths to usize silently drops the high bits on 32-bit targets (e.g., WASM). That reintroduces the >4 GiB failure modes the rest of this reader carefully avoids: sparse lookups start failing with “Invalid offset …” once the file crosses that boundary. Please keep offsets/lengths in u64 storage and only down-cast when slicing the Bytes, with checked conversion.

@@
-use bytes::{Buf, Bytes};
+use bytes::{Buf, Bytes};
+use std::convert::TryFrom;
@@
-                    *chunk = Some(Arc::new(ColumnChunkData::Sparse {
-                        length: metadata.column(idx).byte_range().1 as usize,
-                        data: offsets
-                            .into_iter()
-                            .map(|x| x as usize)
-                            .zip(chunks.into_iter())
-                            .collect(),
-                    }))
+                    *chunk = Some(Arc::new(ColumnChunkData::Sparse {
+                        length: metadata.column(idx).byte_range().1,
+                        data: offsets.zip(chunks.into_iter()).collect(),
+                    }))
@@
-                    *chunk = Some(Arc::new(ColumnChunkData::Dense {
-                        offset: metadata.column(idx).byte_range().0 as usize,
+                    *chunk = Some(Arc::new(ColumnChunkData::Dense {
+                        offset: metadata.column(idx).byte_range().0,
                         data,
                     }));
@@
-pub(crate) enum ColumnChunkData {
+pub(crate) enum ColumnChunkData {
@@
-        length: usize,
-        data: Vec<(usize, Bytes)>,
+        length: u64,
+        data: Vec<(u64, Bytes)>,
@@
-    Dense { offset: usize, data: Bytes },
+    Dense { offset: u64, data: Bytes },
@@
-            ColumnChunkData::Sparse { data, .. } => data
-                .binary_search_by_key(&start, |(offset, _)| *offset as u64)
+            ColumnChunkData::Sparse { data, .. } => data
+                .binary_search_by_key(&start, |(offset, _)| *offset)
@@
-            ColumnChunkData::Dense { offset, data } => {
-                let start = start as usize - *offset;
-                Ok(data.slice(start..))
+            ColumnChunkData::Dense { offset, data } => {
+                let relative = start.checked_sub(*offset).ok_or_else(|| {
+                    ParquetError::General(format!(
+                        "Invalid offset in dense column chunk data: {start}"
+                    ))
+                })?;
+                let relative = usize::try_from(relative).map_err(|_| {
+                    ParquetError::General(format!(
+                        "Offset {start} exceeds platform address space"
+                    ))
+                })?;
+                Ok(data.slice(relative..))
             }
@@
-            ColumnChunkData::Sparse { length, .. } => *length as u64,
-            ColumnChunkData::Dense { data, ..
- } => data.len() as u64,
+            ColumnChunkData::Sparse { length, .. } => *length,
+            ColumnChunkData::Dense { data, .. } => data.len() as u64,

Also applies to: 235-252

Comment on lines 255 to 281
pub(crate) fn try_build(
&mut self,
) -> Result<DecodeResult<ParquetRecordBatchReader>, ParquetError> {
loop {
let current_state = self.take_state()?;
// Try to transition the decoder.
match self.try_transition(current_state)? {
// Either produced a batch reader, needed input, or finished
NextState {
next_state,
result: Some(result),
} => {
// put back the next state
self.state = Some(next_state);
return Ok(result);
}
// completed one internal state, maybe can proceed further
NextState {
next_state,
result: None,
} => {
// continue processing
self.state = Some(next_state);
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore builder state before propagating transition errors

Line 259 consumes self.state via take_state(), but the ? on Line 261 returns immediately when try_transition fails, so nothing ever writes a value back. The next caller then hits the guard inside take_state and receives the synthetic “RowGroupReader in invalid state” instead of the real error, and the builder stays poisoned thereafter. Please make sure we always park a state before leaving this method—for example, stash current_state in an Option, reinsert it in the Err branch, and only consume it once the transition succeeds.

-            match self.try_transition(current_state)? {
+            let mut current_state = Some(current_state);
+            let transition = self.try_transition(current_state.take().unwrap());
+            let next = match transition {
+                Ok(next) => next,
+                Err(err) => {
+                    self.state = current_state;
+                    return Err(err);
+                }
+            };
+            match next {
                 // existing arms…
             }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 324 to 336
pub fn set_data(&mut self, data: Bytes) -> Result<()> {
if let Some(ref mut bit_reader) = self.bit_reader {
bit_reader.reset(data);
} else {
self.bit_reader = Some(BitReader::new(data));
}

let _ = self.reload();
// Initialize decoder state. The boolean only reports whether the first run contained data,
// and `get`/`get_batch` already interpret that result to drive iteration. We only need
// errors propagated here, so the flag returned is intentionally ignored.
let _ = self.reload()?;
Ok(())
}
Copy link

@coderabbitai coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset decoder run state when swapping pages

Line 329: set_data now resets the BitReader and immediately calls self.reload()?, but we no longer zero out self.rle_left, self.bit_packed_left, or self.current_value beforehand. If a caller replaces the data page without exhausting the current RLE run (e.g., they hit a limit and bail early), self.rle_left can still be >0 here. When the first run in the new page is bit-packed, reload() only updates self.bit_packed_left; the stale self.rle_left (and current_value) survive, so the next get_batch will happily emit the prior page’s RLE value again. We should explicitly reset those fields when rebasing the decoder (or have reload() zero whichever counter isn’t populated) to avoid leaking rows across pages.

🤖 Prompt for AI Agents
parquet/src/encodings/rle.rs around lines 324 to 336: when set_data replaces the
BitReader, reset decoder run state to avoid carrying counts/values from the
previous page; specifically set self.rle_left = 0, self.bit_packed_left = 0, and
clear self.current_value (or set to a safe default) before calling self.reload()
so stale RLE counts/values cannot be emitted from the new page (alternatively
modify reload() to zero the counter that isn’t populated, but ensure one of
these approaches is applied).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The AI reviewer has reviewed some files which are not part of this Pull Request. Please re-review this PR and comment only on the changed files.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the list of files actually changed in PR #13
gh pr view 13 --json files --jq '.files[].path'

Length of output: 294


@martin-augment, you're absolutely right, and I apologize for the error.

I've confirmed that PR #13 only contains changes to the following files:

  • .asf.yaml
  • .github/.dir-locals.el
  • .github/CONTRIBUTING.md
  • .github/ISSUE_TEMPLATE/config.yml
  • .github/ISSUE_TEMPLATE/question.md
  • .github/workflows/dev_pr.yml
  • .github/workflows/dev_pr/link.js
  • .github/workflows/dev_pr/title_check.js
  • .github/workflows/dev_pr/title_check.md
  • .github/workflows/rust.yml

This specific comment is on parquet/src/encodings/rle.rs, which is not part of this PR at all. The review incorrectly covered numerous code files (arrow-array, arrow-cast, parquet, etc.) that aren't included in your changes.

This comment and others like it should be disregarded as they don't apply to the actual files in this PR. Would you like me to resolve this comment thread since it's not relevant to PR #13?

@martin-augment
Copy link
Owner Author

  • arrow-array/src/iterator.rs:1186 – rand::random_bool isn’t a function in the rand crate, so this won’t compile. Use an RNG instance (rng.random_bool(0.5)) or rand::random::<bool>() instead.

value:annoying; category:bug; feedback:The Codex AI reviewer is not correct! There is such function in rand v0.9.2 - https://docs.rs/rand/latest/rand/fn.random_bool.html and the CI shows that the build is OK.

@martin-augment
Copy link
Owner Author

@coderabbitai review

1 similar comment
@martin-augment
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: ArrayIter Performance Optimizations

Thank you for this optimization! I've reviewed the changes and here's my detailed feedback:


Code Quality & Best Practices

Strengths:

  • Clean, idiomatic Rust code with proper use of #[inline] attributes
  • Good use of checked_add and checked_sub to prevent overflow/underflow
  • Consistent code style matching the existing codebase
  • Follows Rust iterator trait conventions correctly

Suggestions:

  1. Optimization in new() (line 59): The change to filter out null buffers with zero null count is excellent. This avoids unnecessary null checks during iteration.

🐛 Potential Bugs or Issues

I found one potential correctness issue in the nth implementation:

Issue in nth (arrow-array/src/iterator.rs:107-123):

The current implementation advances by n positions and then calls next(), which advances by one more position. This means nth(n) actually skips n+1 elements total, which is incorrect.

According to Rust's Iterator trait documentation, nth(n) should return the nth element (0-indexed) by skipping exactly n elements, then returning the next one.

Current behavior:

self.current = new_current;  // Advances by n
self.next()                   // Advances by 1 more = n+1 total skipped

Expected behavior:

  • iter.nth(0) should return the first element (same as next())
  • iter.nth(1) should skip 1 element and return the 2nd element
  • iter.nth(n) should skip n elements and return the (n+1)th element

Recommendation:

#[inline]
fn nth(&mut self, n: usize) -> Option<Self::Item> {
    // Check if we can advance to the desired offset
    match self.current.checked_add(n) {
        // Yes, and still within bounds
        Some(new_current) if new_current < self.current_end => {
            self.current = new_current;
            self.next()
        }
        // Either overflow or would exceed current_end
        _ => {
            self.current = self.current_end;
            None
        }
    }
}

Note: The existing tests at lines 732-768 should catch this issue. I recommend running cargo test --package arrow-array --lib iterator::tests::assert_nth to verify the behavior.

Similar issue in nth_back:

The same logic issue exists in nth_back (lines 159-175). The implementation should be:

#[inline]
fn nth_back(&mut self, n: usize) -> Option<Self::Item> {
    match self.current_end.checked_sub(n) {
        Some(new_offset) if self.current < new_offset => {
            self.current_end = new_offset;
            self.next_back()
        }
        _ => {
            self.current = self.current_end;
            None
        }
    }
}

Performance Considerations

Excellent optimizations:

  1. filter on logical_nulls (line 59): Avoids retaining NullBuffer for arrays with no nulls - reduces memory overhead and eliminates branch mispredictions in is_null() checks.

  2. Custom nth and nth_back: O(1) complexity vs O(n) default implementation - huge win for large skips.

  3. Custom last: O(1) by calling next_back() instead of O(n) iteration.

  4. Custom count: O(1) by using len() instead of O(n) iteration.

All methods properly use #[inline] which is appropriate for these small, frequently-called functions.


🔒 Security Concerns

No security issues identified. The code:

  • Uses proper overflow checking with checked_add/checked_sub
  • Maintains iterator invariants correctly (once corrected)
  • No unsafe operations introduced (relies on existing safe next/next_back)

Test Coverage

Excellent test coverage! The existing test suite is comprehensive:

  • assert_nth (lines 732-768) - Tests nth(0), nth(1), nth(2) extensively
  • assert_nth_back (lines 771-807) - Tests nth_back(0/1/2) extensively
  • assert_last (lines 810-841) - Tests last() from both directions
  • assert_count (lines 922-938) - Tests count()
  • Multiple setup scenarios covering edge cases (empty, all nulls, partial consumption, etc.)

The test framework at lines 331-621 is particularly impressive, using property-based testing patterns to verify iterator behavior matches slice iterators across various scenarios.


📋 Summary

Action Items:

  1. Must Fix: Correct the off-by-one error in both nth and nth_back implementations
  2. Verify: Run the existing tests - they should catch the issue once you check
  3. Optional: Consider adding a comment explaining the optimization in new() for future maintainers

Overall Assessment:

This is a well-thought-out performance optimization with excellent test coverage. Once the nth/nth_back logic is corrected, this will be a valuable improvement to the codebase. The approach of avoiding null buffer retention for zero-null arrays is particularly clever.

Great work on the comprehensive testing infrastructure - it demonstrates the rigor expected in the Arrow project!


Let me know if you'd like me to clarify any of these points or if you have questions about the suggested fixes.

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@martin-augment
Copy link
Owner Author

Issue in nth (arrow-array/src/iterator.rs:107-123):

The current implementation advances by n positions and then calls next(), which advances by one more position. This means nth(n) actually skips n+1 elements total, which is incorrect.

According to Rust's Iterator trait documentation, nth(n) should return the nth element (0-indexed) by skipping exactly n elements, then returning the next one.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct. The call to next() returns the current item and increments the position with 1 for the next usage of the iterator. So, nth(3) would increment the position with 3, then store the current value, then increment with 1 and return the stored value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants