8785: perf: override ArrayIter default impl for nth, nth_back, last and count#13
8785: perf: override ArrayIter default impl for nth, nth_back, last and count#13martin-augment wants to merge 3 commits intomainfrom
ArrayIter default impl for nth, nth_back, last and count#13Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis pull request introduces comprehensive error handling improvements throughout the Parquet module, refactors in-memory row group handling with new push-based decoder infrastructure, enhances Arrow iterators with Clone derivation and advanced methods, improves schema parsing for List variants, adds scalar-based zipping support, and updates documentation across multiple README files. Changes
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review - PR #13SummaryThis PR implements performance optimizations for Code Quality & Best Practices✅ Strengths
|
|
|
||
| let no_nulls_and_no_duplicates = (0..10).map(Some).collect::<Vec<Option<i32>>>(); | ||
| let no_nulls_random_values = (0..10) | ||
| .map(|_| rng.random::<i32>()) |
There was a problem hiding this comment.
The rand::Rng trait doesn’t provide random::<T>() or random_bool; this will fail to compile with rand 0.8+ (also applies to other usages below). Consider using the appropriate Rng APIs instead.
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The AI reviewer is wrong! The rand crate does provide these functions in version 0.9.2 (the latest available at the moment). All CI checks pass for this PR, so there is no problem with the compilation.
|
Findings
Next Steps
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
arrow-schema/src/datatype_parse.rs (1)
646-662: Restore escape-state reset inparse_quoted_stringAfter reading a backslash we never clear
is_escapedunless the next character is the quote itself. That means any field name containing a bare\(for example the textList(nullable Int64, field: 'foo\bar'), whichDataType::to_string()emits for a list item namedfoo\bar) causes us to treat the closing'as escaped and run off the end of the token stream. Please reset the escape flag after consuming a non-delimiter and toggle it for consecutive backslashes so these inputs parse correctly.Some(c) => match c { '\\' => { - is_escaped = true; - self.word.push(c); + self.word.push(c); + if is_escaped { + is_escaped = false; + } else { + is_escaped = true; + } } c if c == quote => { if is_escaped { self.word.push(c); is_escaped = false; } else { break; } } c => { - self.word.push(c); + if is_escaped { + is_escaped = false; + } + self.word.push(c); } },
🧹 Nitpick comments (1)
parquet/README.md (1)
76-79: Minor: Standardize reference link casing for consistency.The reference link on line 79 uses lowercase
[implementation status page]while the link text on line 76 uses[Implementation Status Page](mixed case). Although Markdown treats reference links as case-insensitive, for consistency and maintainability—especially sincearrow/README.mdmaintains matching case—please standardize the casing here.Apply this diff to match the case:
- [implementation status page]: https://parquet.apache.org/docs/file-format/implementationstatus/ + [Implementation Status Page]: https://parquet.apache.org/docs/file-format/implementationstatus/
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
.github/workflows/docs.yml(1 hunks)arrow-array/src/iterator.rs(6 hunks)arrow-cast/src/cast/run_array.rs(1 hunks)arrow-flight/tests/flight_sql_client_cli.rs(7 hunks)arrow-schema/src/datatype_display.rs(3 hunks)arrow-schema/src/datatype_parse.rs(16 hunks)arrow-select/src/zip.rs(10 hunks)arrow/README.md(1 hunks)arrow/src/lib.rs(0 hunks)parquet-variant-compute/src/type_conversion.rs(2 hunks)parquet-variant-compute/src/variant_array.rs(20 hunks)parquet-variant-compute/src/variant_get.rs(2 hunks)parquet/README.md(1 hunks)parquet/src/arrow/array_reader/byte_array.rs(2 hunks)parquet/src/arrow/array_reader/byte_array_dictionary.rs(1 hunks)parquet/src/arrow/array_reader/byte_view_array.rs(2 hunks)parquet/src/arrow/array_reader/fixed_len_byte_array.rs(1 hunks)parquet/src/arrow/arrow_reader/mod.rs(3 hunks)parquet/src/arrow/arrow_reader/read_plan.rs(2 hunks)parquet/src/arrow/arrow_reader/selection.rs(0 hunks)parquet/src/arrow/async_reader/mod.rs(4 hunks)parquet/src/arrow/decoder/dictionary_index.rs(1 hunks)parquet/src/arrow/in_memory_row_group.rs(1 hunks)parquet/src/arrow/mod.rs(1 hunks)parquet/src/arrow/push_decoder/mod.rs(1 hunks)parquet/src/arrow/push_decoder/reader_builder/data.rs(1 hunks)parquet/src/arrow/push_decoder/reader_builder/filter.rs(1 hunks)parquet/src/arrow/push_decoder/reader_builder/mod.rs(1 hunks)parquet/src/arrow/push_decoder/remaining.rs(1 hunks)parquet/src/arrow/record_reader/definition_levels.rs(1 hunks)parquet/src/bin/parquet-layout.rs(5 hunks)parquet/src/column/reader.rs(4 hunks)parquet/src/column/reader/decoder.rs(6 hunks)parquet/src/encodings/decoding.rs(2 hunks)parquet/src/encodings/rle.rs(29 hunks)parquet/src/file/metadata/mod.rs(0 hunks)parquet/src/file/metadata/reader.rs(1 hunks)parquet/src/file/serialized_reader.rs(2 hunks)parquet/src/lib.rs(2 hunks)parquet/src/parquet_thrift.rs(1 hunks)parquet/src/util/push_buffers.rs(1 hunks)parquet/tests/arrow_reader/predicate_cache.rs(5 hunks)
💤 Files with no reviewable changes (3)
- parquet/src/arrow/arrow_reader/selection.rs
- parquet/src/file/metadata/mod.rs
- arrow/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (25)
parquet/src/arrow/record_reader/definition_levels.rs (1)
parquet/src/column/reader/decoder.rs (5)
set_data(35-35)set_data(115-121)set_data(199-229)set_data(313-316)set_data(417-422)
parquet/src/arrow/arrow_reader/mod.rs (3)
parquet/src/arrow/async_reader/mod.rs (2)
fmt(774-781)fmt(821-829)parquet/src/file/writer.rs (1)
fmt(173-181)parquet/src/arrow/arrow_writer/mod.rs (3)
fmt(193-202)fmt(687-691)fmt(807-809)
parquet/src/encodings/rle.rs (4)
parquet/src/arrow/record_reader/definition_levels.rs (2)
set_data(134-140)set_data(274-286)parquet/src/column/reader/decoder.rs (5)
set_data(35-35)set_data(115-121)set_data(199-229)set_data(313-316)set_data(417-422)parquet/src/encodings/decoding.rs (13)
set_data(175-175)set_data(306-309)set_data(382-400)set_data(460-479)set_data(665-728)set_data(907-926)set_data(1037-1059)new(290-301)new(362-369)new(449-455)new(561-577)new(894-903)bit_util(469-469)parquet/src/util/bit_util.rs (9)
bit_reader(742-742)bit_reader(744-744)bit_reader(746-746)bit_reader(748-748)new(190-196)new(384-391)ceil(153-155)buffer(268-270)from(709-711)
parquet/src/arrow/decoder/dictionary_index.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
new(87-99)new(179-186)new(262-294)new(348-360)new(442-474)new(530-535)new(566-570)
parquet/src/arrow/array_reader/byte_array.rs (1)
parquet/src/arrow/array_reader/byte_view_array.rs (7)
new(76-88)new(139-146)new(222-256)new(305-317)new(429-433)new(518-550)new(605-610)
parquet/src/bin/parquet-layout.rs (3)
parquet/src/file/serialized_reader.rs (7)
new(117-119)new(181-189)new(282-304)new(532-546)metadata(245-247)metadata(289-293)metadata(308-310)parquet/src/file/metadata/reader.rs (2)
new(101-103)metadata_size(566-568)parquet/src/file/metadata/mod.rs (9)
new(204-213)new(354-356)new(451-459)new(493-513)new(715-725)new(1163-1193)new(1386-1398)new(1552-1560)column(645-647)
parquet/src/arrow/array_reader/fixed_len_byte_array.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
new(87-99)new(179-186)new(262-294)new(348-360)new(442-474)new(530-535)new(566-570)
parquet/src/arrow/push_decoder/remaining.rs (6)
parquet/src/arrow/arrow_reader/selection.rs (1)
row_count(438-440)parquet/src/file/metadata/mod.rs (6)
new(204-213)new(354-356)new(451-459)new(493-513)row_groups(250-252)row_groups(385-387)parquet/src/arrow/push_decoder/reader_builder/data.rs (1)
new(138-155)parquet/src/arrow/push_decoder/reader_builder/mod.rs (3)
new(171-196)push_data(199-201)buffered_bytes(204-206)parquet/src/util/push_buffers.rs (2)
new(79-86)buffered_bytes(135-137)parquet/src/arrow/push_decoder/mod.rs (5)
push_data(410-440)ranges(1112-1115)ranges(1120-1123)buffered_bytes(312-314)buffered_bytes(443-454)
parquet/src/arrow/array_reader/byte_view_array.rs (1)
parquet/src/arrow/array_reader/byte_array.rs (7)
new(87-99)new(179-186)new(262-294)new(348-360)new(442-474)new(530-535)new(566-570)
arrow-select/src/zip.rs (4)
arrow-data/src/transform/mod.rs (1)
new(401-403)arrow-data/src/data.rs (8)
new(1857-1859)new(1919-1932)from(2128-2141)data_type(362-364)data_type(1935-1937)nulls(390-392)nulls(1948-1953)new_null(591-707)arrow-select/src/filter.rs (10)
new(60-62)new(89-93)new(217-231)new(636-653)vec(1837-1839)from(66-68)iter(595-595)iter(1517-1517)iter(1530-1530)iter(1543-1543)arrow-buffer/src/buffer/offset.rs (2)
new_zeroed(98-105)from_repeated_length(153-174)
parquet/src/util/push_buffers.rs (3)
parquet/src/arrow/push_decoder/mod.rs (2)
buffered_bytes(312-314)buffered_bytes(443-454)parquet/src/arrow/push_decoder/reader_builder/mod.rs (1)
buffered_bytes(204-206)parquet/src/arrow/push_decoder/remaining.rs (1)
buffered_bytes(69-71)
arrow-array/src/iterator.rs (2)
arrow-array/src/array/null_array.rs (3)
logical_nulls(116-118)new(56-58)len(100-102)arrow-array/src/array/mod.rs (22)
logical_nulls(225-227)logical_nulls(391-393)logical_nulls(461-463)array(944-944)array(953-953)array(962-962)array(970-970)array(980-980)array(994-994)array(1008-1008)array(1020-1020)array(1042-1042)array(1132-1135)iter(601-601)iter(609-611)iter(618-620)iter(632-632)iter(636-638)iter(641-643)len(155-155)len(366-368)len(445-447)
parquet/src/arrow/push_decoder/reader_builder/mod.rs (7)
parquet/src/file/metadata/mod.rs (6)
schema(563-565)new(204-213)new(354-356)new(451-459)new(493-513)row_group(245-247)parquet/src/arrow/arrow_reader/read_plan.rs (4)
new(40-45)new(160-167)batch_size(247-249)selection(54-56)parquet/src/arrow/push_decoder/reader_builder/data.rs (2)
new(138-155)needed_ranges(49-55)parquet/src/arrow/push_decoder/reader_builder/filter.rs (3)
new(58-66)new(82-88)cache_projection(130-132)parquet/src/util/push_buffers.rs (2)
new(79-86)buffered_bytes(135-137)parquet/src/arrow/in_memory_row_group.rs (1)
column_chunks(196-228)parquet/src/arrow/mod.rs (2)
none(284-288)leaves(295-301)
parquet/src/arrow/push_decoder/reader_builder/data.rs (3)
parquet/src/file/metadata/mod.rs (8)
offset_index(270-272)offset_index(417-419)new(204-213)new(354-356)new(451-459)new(493-513)len(889-891)row_group(245-247)parquet/src/arrow/in_memory_row_group.rs (2)
column_chunks(196-228)len(274-279)parquet/src/util/push_buffers.rs (2)
new(79-86)len(160-162)
parquet-variant-compute/src/variant_get.rs (2)
parquet-variant-compute/src/variant_array.rs (2)
value(339-341)v(1426-1426)arrow-cast/src/cast/mod.rs (1)
default(84-89)
parquet/src/arrow/push_decoder/mod.rs (6)
parquet/src/arrow/arrow_reader/mod.rs (29)
ParquetRecordBatchReader(2028-2030)ParquetRecordBatchReader(2061-2063)ParquetRecordBatchReader(2096-2098)ParquetRecordBatchReader(2371-2373)ParquetRecordBatchReader(2428-2430)ParquetRecordBatchReader(4404-4406)ParquetRecordBatchReader(4449-4451)ParquetRecordBatchReader(4481-4483)ParquetRecordBatchReader(4552-4554)metadata(168-170)metadata(683-685)try_new(590-612)try_new(744-746)try_new(1108-1112)build(858-920)new(397-399)new(1143-1154)new(2518-2525)batch_size(1157-1159)batch(2123-2126)batch(2151-2154)batch(2183-2186)batch(2209-2209)batch(2210-2210)batch(2240-2240)batch(2241-2241)batch(2249-2249)batch(2250-2250)batch(2258-2258)parquet/src/util/push_buffers.rs (5)
file_len(123-125)new(79-86)push_range(101-109)push_ranges(89-98)buffered_bytes(135-137)parquet/src/arrow/push_decoder/reader_builder/data.rs (2)
build(181-221)new(138-155)parquet/src/arrow/push_decoder/reader_builder/filter.rs (3)
new(58-66)new(82-88)builder(68-70)parquet/src/arrow/push_decoder/reader_builder/mod.rs (3)
new(171-196)buffered_bytes(204-206)push_data(199-201)parquet/src/arrow/push_decoder/remaining.rs (3)
new(49-61)buffered_bytes(69-71)push_data(64-66)
arrow-schema/src/datatype_display.rs (2)
arrow-schema/src/field.rs (2)
metadata(373-375)metadata(963-967)arrow-schema/src/schema.rs (2)
metadata(81-83)metadata(410-412)
parquet-variant-compute/src/type_conversion.rs (1)
parquet-variant/src/variant.rs (3)
from(75-77)from(1486-1488)from(1492-1497)
parquet/src/column/reader/decoder.rs (1)
parquet/src/encodings/rle.rs (12)
set_data(324-336)decoder(564-564)decoder(581-582)decoder(626-626)decoder(640-640)decoder(665-666)decoder(683-684)decoder(700-700)decoder(716-716)decoder(733-734)decoder(750-751)decoder(1015-1016)
parquet/src/arrow/async_reader/mod.rs (3)
parquet/src/file/metadata/reader.rs (2)
fetch(1021-1023)fetch(1034-1036)parquet/src/arrow/async_reader/metadata.rs (2)
fetch(69-69)fetch(73-75)parquet/tests/arrow_reader/io/async_reader.rs (2)
ranges(394-401)ranges(403-409)
parquet-variant-compute/src/variant_array.rs (2)
parquet-variant/src/variant/decimal.rs (1)
try_new(91-91)parquet-variant/src/variant.rs (24)
try_new(58-66)try_new(294-297)new(316-321)try_from(83-85)try_from(1661-1665)try_from(1671-1675)try_from(1681-1685)from(75-77)from(1486-1488)from(1492-1497)from(1501-1503)from(1507-1509)from(1513-1515)from(1519-1521)from(1525-1532)from(1536-1543)from(1546-1553)from(1557-1565)from(1569-1571)from(1575-1577)from(1581-1583)from(1587-1589)from(1593-1595)metadata(1436-1442)
parquet/tests/arrow_reader/predicate_cache.rs (1)
parquet/src/arrow/mod.rs (1)
leaves(295-301)
parquet/src/arrow/push_decoder/reader_builder/filter.rs (2)
parquet/src/arrow/push_decoder/reader_builder/data.rs (1)
new(138-155)parquet/src/arrow/push_decoder/reader_builder/mod.rs (1)
new(171-196)
arrow-schema/src/datatype_parse.rs (2)
arrow-schema/src/field.rs (5)
data_type(411-413)new(192-202)fmt(64-100)fmt(972-1002)new_list_field(218-220)arrow-schema/src/datatype_display.rs (2)
fmt(23-178)fmt(185-195)
parquet/src/arrow/in_memory_row_group.rs (2)
parquet/src/file/metadata/mod.rs (3)
column(645-647)offset_index(270-272)offset_index(417-419)parquet/src/file/page_index/offset_index.rs (1)
page_locations(66-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: codex
- GitHub Check: claude-review
🔇 Additional comments (44)
arrow-flight/tests/flight_sql_client_cli.rs (2)
49-52: LGTM! Clean DRY refactoring.The helper function centralizes the command construction pattern, eliminating duplication across multiple test functions. The implementation correctly uses the
cargo_bin!macro and is well-documented.
61-61: LGTM! Consistent application of the helper.All test functions have been consistently updated to use the
flight_sql_client_cmd()helper, maintaining the same test behavior while improving code maintainability.Also applies to: 101-101, 139-139, 178-178, 216-216, 256-256, 297-297
arrow-schema/src/datatype_display.rs (4)
24-26: LGTM! Clean refactor to delegate formatting.The function now properly delegates to the
FormatMetadataadapter, centralizing the formatting logic.
181-196: LGTM! Well-designed formatting adapter.The
FormatMetadataadapter correctly:
- Omits output for empty metadata
- Sorts entries by key for deterministic output
- Uses
debug_map()for consistent formattingThe Vec allocation and sort on each display is acceptable given that metadata maps are typically small.
377-380: LGTM! Test correctly validates sorted multi-entry metadata.The test properly verifies that multiple metadata entries are formatted with keys in sorted order.
Also applies to: 386-386
1-196: Note: PR title/description inconsistency.The PR title mentions "override
ArrayIterdefault impl fornth,nth_back,lastandcount", but this file contains display formatting refactors. This suggests either:
- The PR includes unrelated changes, or
- The PR description doesn't match the actual changes
Please verify the PR scope and update the title/description accordingly.
parquet/src/lib.rs (4)
76-90: Good documentation clarity improvement.The clarification that async APIs require both
arrowandasyncfeatures is helpful. The inline code formatting forRecordBatchimproves readability.
110-110: Reference link is appropriate.Adding the explicit
RecordBatchreference link enables proper cross-referencing in generated documentation. The use of "latest" in the docs.rs URL is standard practice.
1-221: Discrepancy between PR title and file changes.The PR title indicates this PR implements "perf: override
ArrayIterdefault impl fornth,nth_back,lastandcount", but the provided file contains only documentation updates. The AI summary also describes broader changes (error handling, push-decoder infrastructure, Clone derivation) not visible in this file.Confirm whether:
- This is the only file changed in this PR
- The other files implementing the ArrayIter performance improvements are also included in this review
55-74: No issues found with documentation links.All five types referenced in the documentation are properly exported and match their documented module paths:
ArrowWriter,ArrowColumnWriterviaarrow::arrow_writerParquetRecordBatchReaderBuilder,RowFilterviaarrow::arrow_reader(RowFilter is re-exported fromfiltersubmodule)ParquetPushDecoderviaarrow::push_decoderThe documentation is accurate and follows Rust doc conventions.
arrow/README.md (1)
32-37: Documentation enhancement looks good.The new "Arrow Implementation Status" section is well-placed and follows proper Markdown reference link syntax with consistent casing. The link provides helpful navigation for users to check implementation status.
.github/workflows/docs.yml (1)
74-74: Verify breaking changes inactions/download-artifact@v6before upgrading.v6 requires Node.js v24.x (marked as BREAKING CHANGE), and @actions/artifact was bumped to v4.0.0, which may affect behavior/compatibility. The README was updated with artifact extraction details and clarifications about v5 extraction behavior—review these changes to ensure the workflow's artifact handling (lines 75-77) and deployment steps remain compatible with v6's new extraction behavior.
parquet/src/parquet_thrift.rs (1)
278-294: LGTM! Clean fast-path optimization for VLQ decoding.The optimization correctly handles the common case of single-byte VLQ values (0-127) by checking the high bit and returning immediately. For multi-byte values, the first byte is processed before entering the loop, which is logically equivalent to the previous implementation but avoids unnecessary loop overhead for the fast path.
The logic is correct:
- Single byte:
byte & 0x80 == 0means the high bit is clear, so the full byte value is the result- Multi-byte: First byte's lower 7 bits initialize
in_progress, then the loop processes remaining bytes starting at bit position 7arrow-array/src/iterator.rs (8)
47-47: LGTM: Clone derivation enables flexible iterator usage.The Clone derivation is a valuable addition that allows ArrayIter to be used in patterns requiring iterator duplication (e.g., lookahead, backtracking).
59-59: LGTM: Filtering empty null buffers improves memory efficiency.Storing only non-empty null buffers reduces memory overhead when arrays have no nulls, while the is_null method correctly handles the None case.
101-103: LGTM: size_hint now correctly uses iterator bounds.Using
current_end - currentinstead ofarray.len() - currentproperly accounts for consumption from both ends of the iterator.
106-123: LGTM: nth implementation is correct and safe.The implementation properly handles overflow, bounds checking, and delegates to next() for consistent null handling. This will improve performance over the default iterator implementation which would call next() n+1 times.
125-128: LGTM: last delegates efficiently to next_back.This is the optimal implementation for a DoubleEndedIterator with ExactSizeIterator bounds.
130-136: LGTM: count leverages ExactSizeIterator for O(1) performance.This avoids unnecessary iteration and provides constant-time counting.
158-175: LGTM: nth_back implementation is correct and safe.The implementation properly handles underflow, bounds checking, and delegates to next_back() for consistent null handling. Performance improvement over the default implementation is significant.
199-1192: Excellent test coverage validates iterator semantics.The test infrastructure is comprehensive and well-designed:
- Validates ArrayIter behavior matches standard slice iterators across diverse scenarios
- Tests various null distributions, value patterns, and consumption patterns
- Verifies state consistency after operations
- Provides strong confidence in the correctness of the new implementations
arrow-cast/src/cast/run_array.rs (1)
139-141: LGTM: Preallocating capacity improves performance.Reserving capacity upfront avoids multiple reallocations since exactly
partitions.len()elements will be pushed to each vector.parquet/src/encodings/decoding.rs (1)
396-396: LGTM! Proper error propagation.The addition of the
?operator correctly propagates errors fromRleDecoder::set_data, ensuring that initialization failures bubble up to callers rather than being silently ignored.Also applies to: 476-476
parquet/src/arrow/mod.rs (1)
193-200: LGTM! Module declarations for push decoder infrastructure.The new
push_decoderpublic module,in_memory_row_groupinternal module, andDebugimport are properly declared and align with the push-based decoder infrastructure described in the PR summary.parquet/src/arrow/array_reader/byte_array_dictionary.rs (1)
296-296: LGTM! Consistent error propagation.The addition of the
?operator correctly propagates errors fromRleDecoder::set_datain the dictionary decoding path, consistent with the error handling improvements across the codebase.parquet/src/arrow/decoder/dictionary_index.rs (1)
45-57: Thanks for surfacing decoder initialisation errorsLine 45: Converting
DictIndexDecoder::newinto a fallible constructor and propagating theset_datafailure path keeps us from panicking on truncated dictionary index buffers. Looks great.parquet/src/arrow/arrow_reader/read_plan.rs (1)
31-80: Great to have ReadPlan(Builder) implement Debug/CloneLine 31: Deriving
Clone/Debugand ungatingselection()/num_rows_selected()outsidecfg(async)gives the newParquetRecordBatchReaderDebugimpl everything it needs. 👍parquet/src/bin/parquet-layout.rs (1)
107-204: Footer metadata visibility is a nice additionLine 107: Capturing
metadata_sizebeforefinish()and emitting it through the newFooterstruct makes the CLI output more self-contained without re-reading the file. Implementation is tidy.parquet/src/util/push_buffers.rs (1)
134-156: Appreciate the buffer accounting helpersLine 135:
buffered_bytes()andclear_ranges()round out the state needed by the push decoder, and the implementation keeps the buffers/ranges in sync. 👍parquet/src/arrow/array_reader/fixed_len_byte_array.rs (1)
384-386: Callers now respect the fallible dict index constructorLine 384: Thanks for switching the dictionary path over to
DictIndexDecoder::new(...)?; fixed-length byte arrays now pick up the improved error reporting as well.parquet/src/column/reader.rs (2)
439-470: LGTM: Proper error propagation for level decoders.The addition of error propagation via
?onset_datacalls for repetition and definition level decoders improves error handling for DataPage v1. This ensures that any errors during level decoder initialization are properly bubbled up rather than silently failing.
505-528: LGTM: Consistent error handling for DataPage v2.Error propagation is consistently applied to both repetition and definition level decoders in the DataPage v2 path, matching the improvements made to the v1 path.
parquet/src/arrow/array_reader/byte_array.rs (2)
276-278: LGTM: Error propagation from dictionary decoder construction.The change to propagate errors from
ByteArrayDecoderDictionary::newaligns with the broader error handling improvements, ensuring that failures during dictionary index decoder initialization are properly surfaced.
565-570: LGTM: Updated constructor signature.The constructor now returns
Result<Self>to propagate errors fromDictIndexDecoder::new, which is consistent with the error handling pattern used in the parallelByteViewArrayDecoderDictionary::newimplementation.parquet/tests/arrow_reader/predicate_cache.rs (3)
84-95: LGTM: Comprehensive test for nested column handling.The test verifies that the predicate cache behaves correctly with nested columns, covering both sync and async readers. The expectation of 0 records read from cache is appropriate for nested columns.
121-131: LGTM: Well-structured test data setup.The
new_nested()constructor provides a clean way to create test data with nested struct columns (a string column and a struct with two string fields), which is essential for testing nested column scenarios.
289-323: LGTM: Appropriate test data structure.The nested test data with a top-level string column
aand a struct columnb { aa, bb }provides good coverage for testing predicate filtering on nested leaves.parquet/src/arrow/async_reader/mod.rs (1)
977-996: LGTM: Cleaner fetch implementation.The refactored
fetchmethod delegates tofetch_rangesandfill_column_chunks, replacing the previous scattered per-chunk page planning logic. This streamlined approach improves maintainability by centralizing the fetch-and-fill workflow inInMemoryRowGroup.parquet/src/arrow/push_decoder/remaining.rs (2)
49-61: LGTM: Clean state machine initialization.The constructor properly initializes all state including the row group queue, selection, and reader builder. Using
VecDequefor row groups is appropriate for efficient front-popping during iteration.
76-117: LGTM: Well-structured state machine logic.The
try_next_readermethod implements a clear loop-based advancement through row groups:
- Attempts to build a reader for the current row group
- Returns data requests when needed
- Advances to the next row group when current is finished
- Properly handles row count overflow and selection splitting
The error handling at lines 105-110 for row count conversion is particularly good, preventing potential overflow issues.
parquet/src/arrow/array_reader/byte_view_array.rs (1)
236-240: Nice job propagating dictionary init errorsLine 237 now returns the
DictIndexDecoder::newfailure up to callers, keeping the byte-view path consistent with the other decoders and preventing silent corruption when the index stream is invalid.parquet/src/arrow/push_decoder/reader_builder/data.rs (1)
108-110: Appreciate the proactive buffer cleanupLine 109 clearing the fulfilled ranges keeps
PushBuffersbounded for long-lived decodes and matches the cache eviction expectations elsewhere.parquet/src/arrow/push_decoder/mod.rs (1)
555-594: Test coverage for partial pushes is excellentLines 555-594 simulate staggered range delivery and assert the decoder resumes correctly, giving strong confidence in the new state machine.
parquet/src/arrow/push_decoder/reader_builder/filter.rs (1)
90-138: The predicate state progression is tidyLines 90-138 centralize predicate iteration and cache wiring, keeping the RowGroup reader logic much easier to follow.
| } else if !cast_options.safe { | ||
| let value = target.try_value(i)?; | ||
| builder.append_value(value)?; | ||
| } else { | ||
| builder.append_value(target.value(i))?; | ||
| let _ = match target.try_value(i) { | ||
| Ok(v) => builder.append_value(v)?, | ||
| Err(_) => { | ||
| builder.append_null()?; | ||
| false // add this to make match arms have the same return type | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
Compile error in safe null-substitution branch
When cast_options.safe is true, the match expression mixes () from the success arm with bool from the error arm. This does not type-check, so the crate won't compile.
Please restructure the branch so both arms return (). For example:
- let _ = match target.try_value(i) {
- Ok(v) => builder.append_value(v)?,
- Err(_) => {
- builder.append_null()?;
- false // add this to make match arms have the same return type
- }
- };
+ match target.try_value(i) {
+ Ok(v) => builder.append_value(v)?,
+ Err(_) => builder.append_null()?,
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if !cast_options.safe { | |
| let value = target.try_value(i)?; | |
| builder.append_value(value)?; | |
| } else { | |
| builder.append_value(target.value(i))?; | |
| let _ = match target.try_value(i) { | |
| Ok(v) => builder.append_value(v)?, | |
| Err(_) => { | |
| builder.append_null()?; | |
| false // add this to make match arms have the same return type | |
| } | |
| }; | |
| } | |
| } else if !cast_options.safe { | |
| let value = target.try_value(i)?; | |
| builder.append_value(value)?; | |
| } else { | |
| match target.try_value(i) { | |
| Ok(v) => builder.append_value(v)?, | |
| Err(_) => builder.append_null()?, | |
| } | |
| } |
🤖 Prompt for AI Agents
In parquet-variant-compute/src/variant_get.rs around lines 145-156 the safe
null-substitution branch uses a match whose Ok arm yields () while the Err arm
yields a bool, causing a type error; change the control flow so both arms return
unit: call target.try_value(i) once with match or if let, in the Ok arm call
builder.append_value(v)?; in the Err arm call builder.append_null()?; and do not
return any boolean value (ensure both arms end with unit and propagate errors
with ?).
| /// * decoder API: [`ParquetDecoderBuilder::new`] | ||
| /// |
There was a problem hiding this comment.
Doc link points to a non-existent builder name
Line 61 still advertises ParquetDecoderBuilder::new, but I can't find that symbol exported anywhere—the builder lives at ParquetPushDecoderBuilder::new. Without an alias this will send readers hunting for a non-existent type. Could we rename the bullet/link text (or add an alias) so the docs match the actual API surface?
🤖 Prompt for AI Agents
In parquet/src/arrow/arrow_reader/mod.rs around lines 61 to 62, the docs
reference ParquetDecoderBuilder::new which does not exist; update the
documentation to point to the real builder name ParquetPushDecoderBuilder::new
(or add a public alias named ParquetDecoderBuilder that re-exports
ParquetPushDecoderBuilder) so the doc link resolves to an actual exported
symbol; choose either (1) change the bullet/link text and link target to
ParquetPushDecoderBuilder::new, or (2) add pub use
crate::path::ParquetPushDecoderBuilder as ParquetDecoderBuilder; re-run docs
build to confirm the link now resolves.
| length: metadata.column(idx).byte_range().1 as usize, | ||
| data: offsets | ||
| .into_iter() | ||
| .map(|x| x as usize) | ||
| .zip(chunks.into_iter()) | ||
| .collect(), | ||
| })) | ||
| } | ||
| } | ||
| } else { | ||
| for (idx, chunk) in self.column_chunks.iter_mut().enumerate() { | ||
| if chunk.is_some() || !projection.leaf_included(idx) { | ||
| continue; | ||
| } | ||
|
|
||
| if let Some(data) = chunk_data.next() { | ||
| *chunk = Some(Arc::new(ColumnChunkData::Dense { | ||
| offset: metadata.column(idx).byte_range().0 as usize, | ||
| data, | ||
| })); |
There was a problem hiding this comment.
Avoid truncating offsets on 32-bit builds.
Casting u64 offsets and lengths to usize silently drops the high bits on 32-bit targets (e.g., WASM). That reintroduces the >4 GiB failure modes the rest of this reader carefully avoids: sparse lookups start failing with “Invalid offset …” once the file crosses that boundary. Please keep offsets/lengths in u64 storage and only down-cast when slicing the Bytes, with checked conversion.
@@
-use bytes::{Buf, Bytes};
+use bytes::{Buf, Bytes};
+use std::convert::TryFrom;
@@
- *chunk = Some(Arc::new(ColumnChunkData::Sparse {
- length: metadata.column(idx).byte_range().1 as usize,
- data: offsets
- .into_iter()
- .map(|x| x as usize)
- .zip(chunks.into_iter())
- .collect(),
- }))
+ *chunk = Some(Arc::new(ColumnChunkData::Sparse {
+ length: metadata.column(idx).byte_range().1,
+ data: offsets.zip(chunks.into_iter()).collect(),
+ }))
@@
- *chunk = Some(Arc::new(ColumnChunkData::Dense {
- offset: metadata.column(idx).byte_range().0 as usize,
+ *chunk = Some(Arc::new(ColumnChunkData::Dense {
+ offset: metadata.column(idx).byte_range().0,
data,
}));
@@
-pub(crate) enum ColumnChunkData {
+pub(crate) enum ColumnChunkData {
@@
- length: usize,
- data: Vec<(usize, Bytes)>,
+ length: u64,
+ data: Vec<(u64, Bytes)>,
@@
- Dense { offset: usize, data: Bytes },
+ Dense { offset: u64, data: Bytes },
@@
- ColumnChunkData::Sparse { data, .. } => data
- .binary_search_by_key(&start, |(offset, _)| *offset as u64)
+ ColumnChunkData::Sparse { data, .. } => data
+ .binary_search_by_key(&start, |(offset, _)| *offset)
@@
- ColumnChunkData::Dense { offset, data } => {
- let start = start as usize - *offset;
- Ok(data.slice(start..))
+ ColumnChunkData::Dense { offset, data } => {
+ let relative = start.checked_sub(*offset).ok_or_else(|| {
+ ParquetError::General(format!(
+ "Invalid offset in dense column chunk data: {start}"
+ ))
+ })?;
+ let relative = usize::try_from(relative).map_err(|_| {
+ ParquetError::General(format!(
+ "Offset {start} exceeds platform address space"
+ ))
+ })?;
+ Ok(data.slice(relative..))
}
@@
- ColumnChunkData::Sparse { length, .. } => *length as u64,
- ColumnChunkData::Dense { data, ..
- } => data.len() as u64,
+ ColumnChunkData::Sparse { length, .. } => *length,
+ ColumnChunkData::Dense { data, .. } => data.len() as u64,Also applies to: 235-252
| pub(crate) fn try_build( | ||
| &mut self, | ||
| ) -> Result<DecodeResult<ParquetRecordBatchReader>, ParquetError> { | ||
| loop { | ||
| let current_state = self.take_state()?; | ||
| // Try to transition the decoder. | ||
| match self.try_transition(current_state)? { | ||
| // Either produced a batch reader, needed input, or finished | ||
| NextState { | ||
| next_state, | ||
| result: Some(result), | ||
| } => { | ||
| // put back the next state | ||
| self.state = Some(next_state); | ||
| return Ok(result); | ||
| } | ||
| // completed one internal state, maybe can proceed further | ||
| NextState { | ||
| next_state, | ||
| result: None, | ||
| } => { | ||
| // continue processing | ||
| self.state = Some(next_state); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Restore builder state before propagating transition errors
Line 259 consumes self.state via take_state(), but the ? on Line 261 returns immediately when try_transition fails, so nothing ever writes a value back. The next caller then hits the guard inside take_state and receives the synthetic “RowGroupReader in invalid state” instead of the real error, and the builder stays poisoned thereafter. Please make sure we always park a state before leaving this method—for example, stash current_state in an Option, reinsert it in the Err branch, and only consume it once the transition succeeds.
- match self.try_transition(current_state)? {
+ let mut current_state = Some(current_state);
+ let transition = self.try_transition(current_state.take().unwrap());
+ let next = match transition {
+ Ok(next) => next,
+ Err(err) => {
+ self.state = current_state;
+ return Err(err);
+ }
+ };
+ match next {
// existing arms…
}Committable suggestion skipped: line range outside the PR's diff.
| pub fn set_data(&mut self, data: Bytes) -> Result<()> { | ||
| if let Some(ref mut bit_reader) = self.bit_reader { | ||
| bit_reader.reset(data); | ||
| } else { | ||
| self.bit_reader = Some(BitReader::new(data)); | ||
| } | ||
|
|
||
| let _ = self.reload(); | ||
| // Initialize decoder state. The boolean only reports whether the first run contained data, | ||
| // and `get`/`get_batch` already interpret that result to drive iteration. We only need | ||
| // errors propagated here, so the flag returned is intentionally ignored. | ||
| let _ = self.reload()?; | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Reset decoder run state when swapping pages
Line 329: set_data now resets the BitReader and immediately calls self.reload()?, but we no longer zero out self.rle_left, self.bit_packed_left, or self.current_value beforehand. If a caller replaces the data page without exhausting the current RLE run (e.g., they hit a limit and bail early), self.rle_left can still be >0 here. When the first run in the new page is bit-packed, reload() only updates self.bit_packed_left; the stale self.rle_left (and current_value) survive, so the next get_batch will happily emit the prior page’s RLE value again. We should explicitly reset those fields when rebasing the decoder (or have reload() zero whichever counter isn’t populated) to avoid leaking rows across pages.
🤖 Prompt for AI Agents
parquet/src/encodings/rle.rs around lines 324 to 336: when set_data replaces the
BitReader, reset decoder run state to avoid carrying counts/values from the
previous page; specifically set self.rle_left = 0, self.bit_packed_left = 0, and
clear self.current_value (or set to a safe default) before calling self.reload()
so stale RLE counts/values cannot be emitted from the new page (alternatively
modify reload() to zero the counter that isn’t populated, but ensure one of
these approaches is applied).
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The AI reviewer has reviewed some files which are not part of this Pull Request. Please re-review this PR and comment only on the changed files.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Get the list of files actually changed in PR #13
gh pr view 13 --json files --jq '.files[].path'
Length of output: 294
@martin-augment, you're absolutely right, and I apologize for the error.
I've confirmed that PR #13 only contains changes to the following files:
.asf.yaml.github/.dir-locals.el.github/CONTRIBUTING.md.github/ISSUE_TEMPLATE/config.yml.github/ISSUE_TEMPLATE/question.md.github/workflows/dev_pr.yml.github/workflows/dev_pr/link.js.github/workflows/dev_pr/title_check.js.github/workflows/dev_pr/title_check.md.github/workflows/rust.yml
This specific comment is on parquet/src/encodings/rle.rs, which is not part of this PR at all. The review incorrectly covered numerous code files (arrow-array, arrow-cast, parquet, etc.) that aren't included in your changes.
This comment and others like it should be disregarded as they don't apply to the actual files in this PR. Would you like me to resolve this comment thread since it's not relevant to PR #13?
value:annoying; category:bug; feedback:The Codex AI reviewer is not correct! There is such function in rand v0.9.2 - https://docs.rs/rand/latest/rand/fn.random_bool.html and the CI shows that the build is OK. |
|
@coderabbitai review |
1 similar comment
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
PR Review: ArrayIter Performance OptimizationsThank you for this optimization! I've reviewed the changes and here's my detailed feedback: ✅ Code Quality & Best PracticesStrengths:
Suggestions:
🐛 Potential Bugs or IssuesI found one potential correctness issue in the Issue in The current implementation advances by According to Rust's Iterator trait documentation, Current behavior: self.current = new_current; // Advances by n
self.next() // Advances by 1 more = n+1 total skippedExpected behavior:
Recommendation: #[inline]
fn nth(&mut self, n: usize) -> Option<Self::Item> {
// Check if we can advance to the desired offset
match self.current.checked_add(n) {
// Yes, and still within bounds
Some(new_current) if new_current < self.current_end => {
self.current = new_current;
self.next()
}
// Either overflow or would exceed current_end
_ => {
self.current = self.current_end;
None
}
}
}Note: The existing tests at lines 732-768 should catch this issue. I recommend running Similar issue in The same logic issue exists in #[inline]
fn nth_back(&mut self, n: usize) -> Option<Self::Item> {
match self.current_end.checked_sub(n) {
Some(new_offset) if self.current < new_offset => {
self.current_end = new_offset;
self.next_back()
}
_ => {
self.current = self.current_end;
None
}
}
}⚡ Performance ConsiderationsExcellent optimizations:
All methods properly use 🔒 Security ConcernsNo security issues identified. The code:
✅ Test CoverageExcellent test coverage! The existing test suite is comprehensive:
The test framework at lines 331-621 is particularly impressive, using property-based testing patterns to verify iterator behavior matches slice iterators across various scenarios. 📋 SummaryAction Items:
Overall Assessment: This is a well-thought-out performance optimization with excellent test coverage. Once the Great work on the comprehensive testing infrastructure - it demonstrates the rigor expected in the Arrow project! Let me know if you'd like me to clarify any of these points or if you have questions about the suggested fixes. |
✅ Actions performedReview triggered.
|
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct. The call to |
8785: To review by AI
Note
Optimizes
ArrayIterby custom implementations ofnth,nth_back,last, andcount, and avoids retainingNullBufferwhen there are no nulls.logical_nullswhennull_count() > 0to skip zero-null buffers.Iteratormethods:nth,last,countwith bound-checked, fast paths.DoubleEndedIterator::nth_backwith symmetric fast path.Written by Cursor Bugbot for commit b736a6e. This will update automatically on new commits. Configure here.