8790: Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience#14
8790: Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience#14martin-augment wants to merge 3 commits intomainfrom
Vec<RecordBatch> wrapper for pyarrow.Table convenience#14Conversation
CQ fixes CQ fix CQ fix Let `Table` be a combination of `Vec<RecordBatch>` and `SchemaRef` instead `cargo fmt` Overhauled `Table` definition, Added tests Add empty `Table` integration test Update `arrow-pyarrow`'s crate documentation Overhaul documentation even more Typo fix
WalkthroughThis change adds a new Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| .data_type() | ||
| .equals_datatype(right_field.data_type()) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Bug: Incomplete Field Validation
The schema_equals function incorrectly returns true when comparing schemas with different field counts if all fields in the shorter schema match. The zip iterator stops at the shorter of the two iterators, so comparing a 3-field schema against a 2-field schema only checks the first 2 fields. Schemas with different field counts are not equal and should fail validation.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.
| fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool { | ||
| left.fields | ||
| .iter() | ||
| .zip(right.fields.iter()) |
There was a problem hiding this comment.
The schema_equals helper zips field iterators without checking field counts, so schemas with differing numbers of fields could incorrectly be treated as equal; consider verifying left.fields.len() == right.fields.len() before zipping. This would prevent accepting mismatched schemas in Table::try_new.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.
| //! have `Vec<RecordBatch>` on the Rust side and want to export that in bulk as a `pyarrow.Table`. | ||
| //! In general, it is recommended to use streaming approaches instead of dealing with data in bulk. | ||
| //! For example, a `pyarrow.Table` (or any other object that implements the ArrayStream PyCapsule | ||
| //! interface) can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>>` instead of |
There was a problem hiding this comment.
Minor doc typo: the generic in PyArrowType<ArrowArrayStreamReader>> has an extra closing >; should be PyArrowType<ArrowArrayStreamReader>.
🤖 Was this useful? React with 👍 or 👎
Code Review - PR #14: Implement Vec wrapper for pyarrow.Table convenienceSummaryThis PR introduces a Table wrapper type that simplifies converting between Rust's Vec and PyArrow's Table type. The implementation is generally solid, but there are several areas that need attention before merging. Critical Issues1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)There's a commented-out line that should be removed at arrow-pyarrow/src/lib.rs:537. 2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)The PR includes unresolved TODOs and duplicated code from pyo3_arrow/utils.rs. Recommendation:
Major Concerns3. Schema validation uses custom equality checkThe schema_equals function only compares field names and data types, but ignores field metadata, field nullability, and schema-level metadata. Code location: arrow-pyarrow/src/lib.rs:522-531 Issue: This could allow creating Table instances where batches have different nullability or metadata, which might cause issues downstream. Recommendation:
4. Missing public API documentationThe Table struct and its methods lack doc comments explaining when to use Table vs ArrowArrayStreamReader, thread safety guarantees, and memory implications of holding Vec in memory. 5. Arc::unwrap_or_clone behavior (arrow-pyarrow/src/lib.rs:597-599)The Arc::unwrap_or_clone will clone the schema if there are multiple references, which could be unexpected. Consider documenting this behavior. Minor Issues / Suggestions
Positive Aspects
Recommendations SummaryBefore merging:
Nice to have:
Overall, this is a solid contribution that adds useful functionality. With the critical issues addressed, this will be a valuable addition to the arrow-pyarrow crate. Great work on the comprehensive testing! |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
arrow-pyarrow-integration-testing/src/lib.rs(3 hunks)arrow-pyarrow-integration-testing/tests/test_sql.py(1 hunks)arrow-pyarrow/src/lib.rs(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
arrow-pyarrow-integration-testing/src/lib.rs (1)
arrow-pyarrow/src/lib.rs (3)
record_batches(554-556)schema(558-560)try_new(514-552)
arrow-pyarrow-integration-testing/tests/test_sql.py (3)
arrow-pyarrow/src/lib.rs (1)
schema(558-560)arrow-schema/src/ffi.rs (1)
metadata(335-399)arrow-pyarrow-integration-testing/src/lib.rs (2)
build_table(149-160)round_trip_table(144-146)
🪛 Ruff (0.14.3)
arrow-pyarrow-integration-testing/tests/test_sql.py
677-677: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: claude-review
- GitHub Check: codex
| fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool { | ||
| left.fields | ||
| .iter() | ||
| .zip(right.fields.iter()) | ||
| .all(|(left_field, right_field)| { | ||
| left_field.name() == right_field.name() | ||
| && left_field | ||
| .data_type() | ||
| .equals_datatype(right_field.data_type()) | ||
| }) | ||
| } | ||
|
|
||
| for record_batch in &record_batches { | ||
| if !schema_equals(&schema, &record_batch.schema()) { | ||
| return Err(ArrowError::SchemaError( | ||
| //"All record batches must have the same schema.".to_owned(), | ||
| format!( | ||
| "All record batches must have the same schema. \ | ||
| Expected schema: {:?}, got schema: {:?}", | ||
| schema, | ||
| record_batch.schema() |
There was a problem hiding this comment.
Add field-count equality check in schema_equals
schema_equals zips the field iterators but never verifies they have the same length, so a batch with fewer columns than the provided schema still returns true and Table::try_new accepts it. That yields a Table whose stored schema disagrees with the batches and will fail later when exporting to PyArrow. Please reject mismatched field counts before the zip comparison.
fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
+ if left.fields.len() != right.fields.len() {
+ return false;
+ }
left.fields
.iter()
.zip(right.fields.iter())🤖 Prompt for AI Agents
In arrow-pyarrow/src/lib.rs around lines 523 to 543, schema_equals currently
zips the two field iterators but doesn't verify they have the same length,
allowing schemas with mismatched field counts to be considered equal; modify
schema_equals to first compare left.fields.len() and right.fields.len() and
return false if they differ, then proceed with the existing zip/all comparison
so only schemas with identical field counts and matching names/types are
considered equal.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.
There was a problem hiding this comment.
@martin-augment Thank you for confirming the issue! I'm glad the review caught this bug before it could cause problems downstream when exporting to PyArrow. The field-count check will ensure that only truly equivalent schemas are accepted by Table::try_new. 🎯
(\(\
( -.-)
o_(")(")
Happy to help! 🎉
|
value:useful; category:bug; feedback:The Codex AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other. |
value:annoying; category:bug; feedback:The Codex AI reviewer is commenting on file which is not part of this Pull Request |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that there is a commented out line that could be removed but this is not a critical issue. The commented out line does not affect the behavior of the code in any way. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that this TODO should be addressed soon because the current custom implementation is not correct - it uses .zip() that does not compare the iterators' length. |
8790: To review by AI
Note
Introduce a
Tablewrapper overVec<RecordBatch>with PyArrow conversions, plus Python bindings and tests for table roundtrips and schema validation.Tablewrapper: EncapsulatesVec<RecordBatch>withtry_new(schema consistency check), accessors, andinto_inner.FromPyArrow/IntoPyArrowforTable; supportTryFrom<Box<dyn RecordBatchReader>>; export topyarrow.TableviaTable.from_batcheswithschema.pyarrow.Table; add guidance favoring streams over bulk.arrow-pyarrow-integration-testing/src/lib.rs):round_trip_table(table)andbuild_table(record_batches, schema); register in module exports.tests/test_sql.py):Written by Cursor Bugbot for commit af2156d. This will update automatically on new commits. Configure here.