8790: Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience by martin-augment · Pull Request #14 · martin-augment/arrow-rs

martin-augment · 2025-11-10T12:35:51Z

8790: To review by AI

Note

Introduce a Table wrapper over Vec<RecordBatch> with PyArrow conversions, plus Python bindings and tests for table roundtrips and schema validation.

arrow-pyarrow (core):
- New Table wrapper: Encapsulates Vec<RecordBatch> with try_new (schema consistency check), accessors, and into_inner.
- Conversions: Implement FromPyArrow/IntoPyArrow for Table; support TryFrom<Box<dyn RecordBatchReader>>; export to pyarrow.Table via Table.from_batches with schema.
- Docs: Extend PyArrow↔arrow-rs type mapping to include pyarrow.Table; add guidance favoring streams over bulk.
Integration bindings (arrow-pyarrow-integration-testing/src/lib.rs):
- Add round_trip_table(table) and build_table(record_batches, schema); register in module exports.
Tests (tests/test_sql.py):
- Add table tests: empty, roundtrip, from_batches, and inconsistent schema error; ensure schema/metadata and batch counts match.

^{Written by Cursor Bugbot for commit af2156d. This will update automatically on new commits. Configure here.}

CQ fixes CQ fix CQ fix Let `Table` be a combination of `Vec<RecordBatch>` and `SchemaRef` instead `cargo fmt` Overhauled `Table` definition, Added tests Add empty `Table` integration test Update `arrow-pyarrow`'s crate documentation Overhaul documentation even more Typo fix

coderabbitai · 2025-11-10T12:36:08Z

Walkthrough

This change adds a new Table struct to the arrow-pyarrow crate with schema validation and Python interoperability, exposes Rust functions for table round-trips and construction in the integration testing module, and adds Python test coverage for table operations across the boundary.

Changes

Cohort / File(s)	Summary
Arrow PyArrow Type Infrastructure `arrow-pyarrow/src/lib.rs`	Introduces public `Table` struct with fields for record batches and schema, adds `try_new()` constructor with schema validation, implements accessors (`record_batches()`, `schema()`, `into_inner()`), and adds trait implementations (`TryFrom<Box<dyn RecordBatchReader>>`, `FromPyArrow`, `IntoPyArrow`) for Python conversion. Also adds `PyArrowException` alias.
Integration Testing Module Functions `arrow-pyarrow-integration-testing/src/lib.rs`	Adds `round_trip_table()` function that returns the same `Table` object and `build_table()` function that constructs a `Table` from record batches and schema. Updates module initialization to register both functions.
Python Test Coverage `arrow-pyarrow-integration-testing/tests/test_sql.py`	Adds four test functions: `test_table_empty()`, `test_table_roundtrip()`, `test_table_from_batches()`, and `test_table_error_inconsistent_schema()` that validate Python-to-Rust-to-Python round-trips and schema consistency. Note: Each test appears to be defined twice in the file.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-8790-2025-11-10-12-35-47

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-10T12:37:14Z

arrow-pyarrow/src/lib.rs

+                            .data_type()
+                            .equals_datatype(right_field.data_type())
+                })
+        }


Bug: Incomplete Field Validation

The schema_equals function incorrectly returns true when comparing schemas with different field counts if all fields in the shorter schema match. The zip iterator stops at the shorter of the two iterators, so comparing a 3-field schema against a 2-field schema only checks the first 2 fields. Schemas with different field counts are not equal and should fail validation.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-10T12:37:41Z

arrow-pyarrow/src/lib.rs

+        fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
+            left.fields
+                .iter()
+                .zip(right.fields.iter())


The schema_equals helper zips field iterators without checking field counts, so schemas with differing numbers of fields could incorrectly be treated as equal; consider verifying left.fields.len() == right.fields.len() before zipping. This would prevent accepting mismatched schemas in Table::try_new.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

augmentcode · 2025-11-10T12:37:41Z

arrow-pyarrow/src/lib.rs

+//! have `Vec<RecordBatch>` on the Rust side and want to export that in bulk as a `pyarrow.Table`.
+//! In general, it is recommended to use streaming approaches instead of dealing with data in bulk.
+//! For example, a `pyarrow.Table` (or any other object that implements the ArrayStream PyCapsule
+//! interface) can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>>` instead of


Minor doc typo: the generic in PyArrowType<ArrowArrayStreamReader>> has an extra closing >; should be PyArrowType<ArrowArrayStreamReader>.

_{🤖 Was this useful? React with 👍 or 👎}

claude · 2025-11-10T12:38:42Z

Code Review - PR #14: Implement Vec wrapper for pyarrow.Table convenience

Summary

This PR introduces a Table wrapper type that simplifies converting between Rust's Vec and PyArrow's Table type. The implementation is generally solid, but there are several areas that need attention before merging.

Critical Issues

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

There's a commented-out line that should be removed at arrow-pyarrow/src/lib.rs:537.

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

The PR includes unresolved TODOs and duplicated code from pyo3_arrow/utils.rs.

Recommendation:

Investigate whether Schema::eq or SchemaRef::eq can be used directly
If the custom comparison is truly needed, document why
Consider if this belongs in a shared utility module
Resolve the TODO before merging or create a follow-up issue

Major Concerns

3. Schema validation uses custom equality check

The schema_equals function only compares field names and data types, but ignores field metadata, field nullability, and schema-level metadata.

Code location: arrow-pyarrow/src/lib.rs:522-531

Issue: This could allow creating Table instances where batches have different nullability or metadata, which might cause issues downstream.

Recommendation:

If there's a valid reason to ignore metadata/nullability, document it clearly
Otherwise, use a stricter comparison or the standard Schema::eq
Add tests for edge cases (different nullability, different metadata)

4. Missing public API documentation

The Table struct and its methods lack doc comments explaining when to use Table vs ArrowArrayStreamReader, thread safety guarantees, and memory implications of holding Vec in memory.

5. Arc::unwrap_or_clone behavior (arrow-pyarrow/src/lib.rs:597-599)

The Arc::unwrap_or_clone will clone the schema if there are multiple references, which could be unexpected. Consider documenting this behavior.

Minor Issues / Suggestions

Test TODO comment suggests known metadata handling issue - create tracking issue
Error message uses debug formatting which can be verbose
Table derives Clone (potentially expensive) - document why this is needed
Consider API symmetry for method naming

Positive Aspects

Good test coverage: Comprehensive tests covering empty tables, roundtrip conversion, schema validation, and error cases
Documentation updates: Crate-level documentation properly updated with appropriate caveats
Proper error handling: Schema validation performed upfront in try_new
Follows existing patterns: FromPyArrow/IntoPyArrow implementations follow established patterns
Integration testing: Both unit tests and integration tests with actual PyArrow objects

Recommendations Summary

Before merging:

Remove commented-out code (line 537)
Resolve or document the schema_equals TODO
Add comprehensive doc comments to public API
Clarify the metadata/nullability behavior in schema comparison
Address or create tracking issue for metadata TODO in tests

Nice to have:

Additional tests for edge cases (metadata, nullability)
Consider API symmetry for accessing table components
Document Clone implications

Overall, this is a solid contribution that adds useful functionality. With the critical issues addressed, this will be a valuable addition to the arrow-pyarrow crate. Great work on the comprehensive testing!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd40a52 and af2156d.

📒 Files selected for processing (3)

arrow-pyarrow-integration-testing/src/lib.rs (3 hunks)
arrow-pyarrow-integration-testing/tests/test_sql.py (1 hunks)
arrow-pyarrow/src/lib.rs (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

arrow-pyarrow-integration-testing/src/lib.rs (1)

arrow-pyarrow/src/lib.rs (3)

record_batches (554-556)

schema (558-560)

try_new (514-552)

arrow-pyarrow-integration-testing/tests/test_sql.py (3)

arrow-pyarrow/src/lib.rs (1)

schema (558-560)

arrow-schema/src/ffi.rs (1)

metadata (335-399)

arrow-pyarrow-integration-testing/src/lib.rs (2)

build_table (149-160)

round_trip_table (144-146)

🪛 Ruff (0.14.3)

arrow-pyarrow-integration-testing/tests/test_sql.py

677-677: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: claude-review
GitHub Check: codex

coderabbitai · 2025-11-10T12:40:56Z

arrow-pyarrow/src/lib.rs

+        fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
+            left.fields
+                .iter()
+                .zip(right.fields.iter())
+                .all(|(left_field, right_field)| {
+                    left_field.name() == right_field.name()
+                        && left_field
+                            .data_type()
+                            .equals_datatype(right_field.data_type())
+                })
+        }
+
+        for record_batch in &record_batches {
+            if !schema_equals(&schema, &record_batch.schema()) {
+                return Err(ArrowError::SchemaError(
+                    //"All record batches must have the same schema.".to_owned(),
+                    format!(
+                        "All record batches must have the same schema. \
+                         Expected schema: {:?}, got schema: {:?}",
+                        schema,
+                        record_batch.schema()


⚠️ Potential issue | 🟠 Major

Add field-count equality check in schema_equals

schema_equals zips the field iterators but never verifies they have the same length, so a batch with fewer columns than the provided schema still returns true and Table::try_new accepts it. That yields a Table whose stored schema disagrees with the batches and will fail later when exporting to PyArrow. Please reject mismatched field counts before the zip comparison.

fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool { + if left.fields.len() != right.fields.len() { + return false; + } left.fields .iter() .zip(right.fields.iter())

🤖 Prompt for AI Agents

In arrow-pyarrow/src/lib.rs around lines 523 to 543, schema_equals currently zips the two field iterators but doesn't verify they have the same length, allowing schemas with mismatched field counts to be considered equal; modify schema_equals to first compare left.fields.len() and right.fields.len() and return false if they differ, then proceed with the existing zip/all comparison so only schemas with identical field counts and matching names/types are considered equal.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

@martin-augment Thank you for confirming the issue! I'm glad the review caught this bug before it could cause problems downstream when exporting to PyArrow. The field-count check will ensure that only truly equivalent schemas are accepted by Table::try_new. 🎯

(\(\ ( -.-) o_(")(") Happy to help! 🎉

github-actions · 2025-11-10T12:40:58Z

Blocker arrow-buffer/src/util/bit_util.rs:1 – This branch is missing the entire set of bitwise helpers (read_up_to_byte_from_offset, apply_bitwise_binary_op, etc.) and associated tests that landed on main in bd40a528…. Likewise, arrow-json/src/reader/binary_array.rs#L1 is deleted along with the new binary JSON documentation and tests. Merging this PR would revert those freshly added features; it needs to be rebased onto the current base before the new Table work can be reviewed safely.
Bug arrow-pyarrow/src/lib.rs:523 – schema_equals zips the field iterators without first checking that their lengths match. If the provided schema has extra fields, the zip stops at the shorter length and the function still returns true, letting a Table accept mismatched schemas. Downstream this will panic when exporting to PyArrow (the schema column count won’t match the batches). Please compare the lengths up front (and ideally reuse the existing schema equality helper) before accepting the batches.

martin-augment · 2025-11-10T12:52:49Z

Bug arrow-pyarrow/src/lib.rs:523 – schema_equals zips the field iterators without first checking that their lengths match. If the provided schema has extra fields, the zip stops at the shorter length and the function still returns true, letting a Table accept mismatched schemas. Downstream this will panic when exporting to PyArrow (the schema column count won’t match the batches). Please compare the lengths up front (and ideally reuse the existing schema equality helper) before accepting the batches.

value:useful; category:bug; feedback:The Codex AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

martin-augment · 2025-11-10T12:53:32Z

Blocker arrow-buffer/src/util/bit_util.rs:1 – This branch is missing the entire set of bitwise helpers (read_up_to_byte_from_offset, apply_bitwise_binary_op, etc.) and associated tests that landed on main in bd40a528…. Likewise, arrow-json/src/reader/binary_array.rs#L1 is deleted along with the new binary JSON documentation and tests. Merging this PR would revert those freshly added features; it needs to be rebased onto the current base before the new Table work can be reviewed safely.

value:annoying; category:bug; feedback:The Codex AI reviewer is commenting on file which is not part of this Pull Request

martin-augment · 2025-11-10T12:55:19Z

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

There's a commented-out line that should be removed at arrow-pyarrow/src/lib.rs:537.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that there is a commented out line that could be removed but this is not a critical issue. The commented out line does not affect the behavior of the code in any way.

martin-augment · 2025-11-10T12:57:10Z

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

The PR includes unresolved TODOs and duplicated code from pyo3_arrow/utils.rs.

Recommendation:

Investigate whether Schema::eq or SchemaRef::eq can be used directly

If the custom comparison is truly needed, document why

Consider if this belongs in a shared utility module

Resolve the TODO before merging or create a follow-up issue

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that this TODO should be addressed soon because the current custom implementation is not correct - it uses .zip() that does not compare the iterators' length.

jonded94 added 3 commits November 7, 2025 15:52

Solve review remarks; current status still broken

7695898

Use pyo3_arrow's schema_equals function for now

af2156d

cursor bot reviewed Nov 10, 2025

View reviewed changes

augmentcode bot reviewed Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

Conversation

martin-augment commented Nov 10, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

cursor bot Nov 10, 2025

Choose a reason for hiding this comment

Bug: Incomplete Field Validation

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 10, 2025

Code Review - PR #14: Implement Vec wrapper for pyarrow.Table convenience

Summary

Critical Issues

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

Major Concerns

3. Schema validation uses custom equality check

4. Missing public API documentation

5. Arc::unwrap_or_clone behavior (arrow-pyarrow/src/lib.rs:597-599)

Minor Issues / Suggestions

Positive Aspects

Recommendations Summary

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

martin-augment commented Nov 10, 2025

Uh oh!

martin-augment commented Nov 10, 2025

Uh oh!

martin-augment commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

Uh oh!

martin-augment commented Nov 10, 2025

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martin-augment commented Nov 10, 2025 •

edited by cursor bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

coderabbitai bot Nov 10, 2025 •

edited

Loading

martin-augment commented Nov 10, 2025 •

edited

Loading