Skip to content

8790: Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience#14

Open
martin-augment wants to merge 3 commits intomainfrom
pr-8790-2025-11-10-12-35-47
Open

8790: Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience#14
martin-augment wants to merge 3 commits intomainfrom
pr-8790-2025-11-10-12-35-47

Conversation

@martin-augment
Copy link
Owner

@martin-augment martin-augment commented Nov 10, 2025

8790: To review by AI


Note

Introduce a Table wrapper over Vec<RecordBatch> with PyArrow conversions, plus Python bindings and tests for table roundtrips and schema validation.

  • arrow-pyarrow (core):
    • New Table wrapper: Encapsulates Vec<RecordBatch> with try_new (schema consistency check), accessors, and into_inner.
    • Conversions: Implement FromPyArrow/IntoPyArrow for Table; support TryFrom<Box<dyn RecordBatchReader>>; export to pyarrow.Table via Table.from_batches with schema.
    • Docs: Extend PyArrow↔arrow-rs type mapping to include pyarrow.Table; add guidance favoring streams over bulk.
  • Integration bindings (arrow-pyarrow-integration-testing/src/lib.rs):
    • Add round_trip_table(table) and build_table(record_batches, schema); register in module exports.
  • Tests (tests/test_sql.py):
    • Add table tests: empty, roundtrip, from_batches, and inconsistent schema error; ensure schema/metadata and batch counts match.

Written by Cursor Bugbot for commit af2156d. This will update automatically on new commits. Configure here.

CQ fixes

CQ fix

CQ fix

Let `Table` be a combination of `Vec<RecordBatch>` and `SchemaRef` instead

`cargo fmt`

Overhauled `Table` definition, Added tests

Add empty `Table` integration test

Update `arrow-pyarrow`'s crate documentation

Overhaul documentation even more

Typo fix
@coderabbitai
Copy link

coderabbitai bot commented Nov 10, 2025

Walkthrough

This change adds a new Table struct to the arrow-pyarrow crate with schema validation and Python interoperability, exposes Rust functions for table round-trips and construction in the integration testing module, and adds Python test coverage for table operations across the boundary.

Changes

Cohort / File(s) Summary
Arrow PyArrow Type Infrastructure
arrow-pyarrow/src/lib.rs
Introduces public Table struct with fields for record batches and schema, adds try_new() constructor with schema validation, implements accessors (record_batches(), schema(), into_inner()), and adds trait implementations (TryFrom<Box<dyn RecordBatchReader>>, FromPyArrow, IntoPyArrow) for Python conversion. Also adds PyArrowException alias.
Integration Testing Module Functions
arrow-pyarrow-integration-testing/src/lib.rs
Adds round_trip_table() function that returns the same Table object and build_table() function that constructs a Table from record batches and schema. Updates module initialization to register both functions.
Python Test Coverage
arrow-pyarrow-integration-testing/tests/test_sql.py
Adds four test functions: test_table_empty(), test_table_roundtrip(), test_table_from_batches(), and test_table_error_inconsistent_schema() that validate Python-to-Rust-to-Python round-trips and schema consistency. Note: Each test appears to be defined twice in the file.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8790-2025-11-10-12-35-47

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

.data_type()
.equals_datatype(right_field.data_type())
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incomplete Field Validation

The schema_equals function incorrectly returns true when comparing schemas with different field counts if all fields in the shorter schema match. The zip iterator stops at the shorter of the two iterators, so comparing a 3-field schema against a 2-field schema only checks the first 2 fields. Schemas with different field counts are not equal and should fail validation.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
left.fields
.iter()
.zip(right.fields.iter())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema_equals helper zips field iterators without checking field counts, so schemas with differing numbers of fields could incorrectly be treated as equal; consider verifying left.fields.len() == right.fields.len() before zipping. This would prevent accepting mismatched schemas in Table::try_new.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

//! have `Vec<RecordBatch>` on the Rust side and want to export that in bulk as a `pyarrow.Table`.
//! In general, it is recommended to use streaming approaches instead of dealing with data in bulk.
//! For example, a `pyarrow.Table` (or any other object that implements the ArrayStream PyCapsule
//! interface) can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>>` instead of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor doc typo: the generic in PyArrowType<ArrowArrayStreamReader>> has an extra closing >; should be PyArrowType<ArrowArrayStreamReader>.

🤖 Was this useful? React with 👍 or 👎

@claude
Copy link

claude bot commented Nov 10, 2025

Code Review - PR #14: Implement Vec wrapper for pyarrow.Table convenience

Summary

This PR introduces a Table wrapper type that simplifies converting between Rust's Vec and PyArrow's Table type. The implementation is generally solid, but there are several areas that need attention before merging.


Critical Issues

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

There's a commented-out line that should be removed at arrow-pyarrow/src/lib.rs:537.

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

The PR includes unresolved TODOs and duplicated code from pyo3_arrow/utils.rs.

Recommendation:

  • Investigate whether Schema::eq or SchemaRef::eq can be used directly
  • If the custom comparison is truly needed, document why
  • Consider if this belongs in a shared utility module
  • Resolve the TODO before merging or create a follow-up issue

Major Concerns

3. Schema validation uses custom equality check

The schema_equals function only compares field names and data types, but ignores field metadata, field nullability, and schema-level metadata.

Code location: arrow-pyarrow/src/lib.rs:522-531

Issue: This could allow creating Table instances where batches have different nullability or metadata, which might cause issues downstream.

Recommendation:

  • If there's a valid reason to ignore metadata/nullability, document it clearly
  • Otherwise, use a stricter comparison or the standard Schema::eq
  • Add tests for edge cases (different nullability, different metadata)

4. Missing public API documentation

The Table struct and its methods lack doc comments explaining when to use Table vs ArrowArrayStreamReader, thread safety guarantees, and memory implications of holding Vec in memory.

5. Arc::unwrap_or_clone behavior (arrow-pyarrow/src/lib.rs:597-599)

The Arc::unwrap_or_clone will clone the schema if there are multiple references, which could be unexpected. Consider documenting this behavior.


Minor Issues / Suggestions

  • Test TODO comment suggests known metadata handling issue - create tracking issue
  • Error message uses debug formatting which can be verbose
  • Table derives Clone (potentially expensive) - document why this is needed
  • Consider API symmetry for method naming

Positive Aspects

  1. Good test coverage: Comprehensive tests covering empty tables, roundtrip conversion, schema validation, and error cases
  2. Documentation updates: Crate-level documentation properly updated with appropriate caveats
  3. Proper error handling: Schema validation performed upfront in try_new
  4. Follows existing patterns: FromPyArrow/IntoPyArrow implementations follow established patterns
  5. Integration testing: Both unit tests and integration tests with actual PyArrow objects

Recommendations Summary

Before merging:

  1. Remove commented-out code (line 537)
  2. Resolve or document the schema_equals TODO
  3. Add comprehensive doc comments to public API
  4. Clarify the metadata/nullability behavior in schema comparison
  5. Address or create tracking issue for metadata TODO in tests

Nice to have:

  • Additional tests for edge cases (metadata, nullability)
  • Consider API symmetry for accessing table components
  • Document Clone implications

Overall, this is a solid contribution that adds useful functionality. With the critical issues addressed, this will be a valuable addition to the arrow-pyarrow crate. Great work on the comprehensive testing!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd40a52 and af2156d.

📒 Files selected for processing (3)
  • arrow-pyarrow-integration-testing/src/lib.rs (3 hunks)
  • arrow-pyarrow-integration-testing/tests/test_sql.py (1 hunks)
  • arrow-pyarrow/src/lib.rs (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
arrow-pyarrow-integration-testing/src/lib.rs (1)
arrow-pyarrow/src/lib.rs (3)
  • record_batches (554-556)
  • schema (558-560)
  • try_new (514-552)
arrow-pyarrow-integration-testing/tests/test_sql.py (3)
arrow-pyarrow/src/lib.rs (1)
  • schema (558-560)
arrow-schema/src/ffi.rs (1)
  • metadata (335-399)
arrow-pyarrow-integration-testing/src/lib.rs (2)
  • build_table (149-160)
  • round_trip_table (144-146)
🪛 Ruff (0.14.3)
arrow-pyarrow-integration-testing/tests/test_sql.py

677-677: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: codex

Comment on lines +523 to +543
fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
left.fields
.iter()
.zip(right.fields.iter())
.all(|(left_field, right_field)| {
left_field.name() == right_field.name()
&& left_field
.data_type()
.equals_datatype(right_field.data_type())
})
}

for record_batch in &record_batches {
if !schema_equals(&schema, &record_batch.schema()) {
return Err(ArrowError::SchemaError(
//"All record batches must have the same schema.".to_owned(),
format!(
"All record batches must have the same schema. \
Expected schema: {:?}, got schema: {:?}",
schema,
record_batch.schema()
Copy link

@coderabbitai coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add field-count equality check in schema_equals

schema_equals zips the field iterators but never verifies they have the same length, so a batch with fewer columns than the provided schema still returns true and Table::try_new accepts it. That yields a Table whose stored schema disagrees with the batches and will fail later when exporting to PyArrow. Please reject mismatched field counts before the zip comparison.

         fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
+            if left.fields.len() != right.fields.len() {
+                return false;
+            }
             left.fields
                 .iter()
                 .zip(right.fields.iter())
🤖 Prompt for AI Agents
In arrow-pyarrow/src/lib.rs around lines 523 to 543, schema_equals currently
zips the two field iterators but doesn't verify they have the same length,
allowing schemas with mismatched field counts to be considered equal; modify
schema_equals to first compare left.fields.len() and right.fields.len() and
return false if they differ, then proceed with the existing zip/all comparison
so only schemas with identical field counts and matching names/types are
considered equal.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-augment Thank you for confirming the issue! I'm glad the review caught this bug before it could cause problems downstream when exporting to PyArrow. The field-count check will ensure that only truly equivalent schemas are accepted by Table::try_new. 🎯

    (\(\
    ( -.-)
    o_(")(")
    
    Happy to help! 🎉

@github-actions
Copy link

  • Blocker arrow-buffer/src/util/bit_util.rs:1 – This branch is missing the entire set of bitwise helpers (read_up_to_byte_from_offset, apply_bitwise_binary_op, etc.) and associated tests that landed on main in bd40a528…. Likewise, arrow-json/src/reader/binary_array.rs#L1 is deleted along with the new binary JSON documentation and tests. Merging this PR would revert those freshly added features; it needs to be rebased onto the current base before the new Table work can be reviewed safely.
  • Bug arrow-pyarrow/src/lib.rs:523schema_equals zips the field iterators without first checking that their lengths match. If the provided schema has extra fields, the zip stops at the shorter length and the function still returns true, letting a Table accept mismatched schemas. Downstream this will panic when exporting to PyArrow (the schema column count won’t match the batches). Please compare the lengths up front (and ideally reuse the existing schema equality helper) before accepting the batches.

@martin-augment
Copy link
Owner Author

  • Bug arrow-pyarrow/src/lib.rs:523schema_equals zips the field iterators without first checking that their lengths match. If the provided schema has extra fields, the zip stops at the shorter length and the function still returns true, letting a Table accept mismatched schemas. Downstream this will panic when exporting to PyArrow (the schema column count won’t match the batches). Please compare the lengths up front (and ideally reuse the existing schema equality helper) before accepting the batches.

value:useful; category:bug; feedback:The Codex AI reviewer is correct that the implementation of the equals function is not correct! The finding prevents a wrong result when of the schemas has more fields than the other.

@martin-augment
Copy link
Owner Author

  • Blocker arrow-buffer/src/util/bit_util.rs:1 – This branch is missing the entire set of bitwise helpers (read_up_to_byte_from_offset, apply_bitwise_binary_op, etc.) and associated tests that landed on main in bd40a528…. Likewise, arrow-json/src/reader/binary_array.rs#L1 is deleted along with the new binary JSON documentation and tests. Merging this PR would revert those freshly added features; it needs to be rebased onto the current base before the new Table work can be reviewed safely.

value:annoying; category:bug; feedback:The Codex AI reviewer is commenting on file which is not part of this Pull Request

@martin-augment
Copy link
Owner Author

martin-augment commented Nov 10, 2025

1. Commented-out error message (arrow-pyarrow/src/lib.rs:537-538)

There's a commented-out line that should be removed at arrow-pyarrow/src/lib.rs:537.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that there is a commented out line that could be removed but this is not a critical issue. The commented out line does not affect the behavior of the code in any way.

@martin-augment
Copy link
Owner Author

2. TODO comment about schema_equals function (arrow-pyarrow/src/lib.rs:517-521)

The PR includes unresolved TODOs and duplicated code from pyo3_arrow/utils.rs.

Recommendation:

  • Investigate whether Schema::eq or SchemaRef::eq can be used directly
  • If the custom comparison is truly needed, document why
  • Consider if this belongs in a shared utility module
  • Resolve the TODO before merging or create a follow-up issue

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that this TODO should be addressed soon because the current custom implementation is not correct - it uses .zip() that does not compare the iterators' length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants