8829: Allow Users to Provide Custom `ArrayFormatter`s when Pretty-Printing Record Batches by martin-augment · Pull Request #22 · martin-augment/arrow-rs

martin-augment · 2025-11-14T11:28:08Z

8829: To review by AI

…ture flags

coderabbitai · 2025-11-14T11:28:27Z

Walkthrough

This change introduces a customizable formatting mechanism for array display in the pretty-printing functionality. A new formatter_factory field is added to FormatOptions<'a>, enabling user-provided custom formatters for specific array types. The formatting infrastructure (DisplayIndex trait, FormatError enum, FormatResult type) is made public. The pretty-printing logic is updated to validate schema field counts, align columns with schema fields, and apply custom formatters when available. Comprehensive tests demonstrate custom formatting of batches and columns with schema overrides. No public API signatures are modified in pretty.rs.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-8829-2025-11-14-11-28-04

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-14T11:29:20Z

arrow-cast/src/display.rs

+                (None, None) => true,
+                _ => false,
+            }
+    }


Bug: Broken Hash/Equality Contract

The PartialEq implementation for FormatOptions does not compare the types_info field, but the Hash implementation includes it. This violates Rust's hash/equality contract: if two values are equal according to PartialEq, they must produce the same hash. FormatOptions values differing only in types_info will be equal but have different hashes, causing issues in hash-based collections.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions.

claude · 2025-11-14T11:30:00Z

Code Review

Thank you for this PR! This is a well-designed feature that adds extensibility to the pretty-printing functionality. Here's my detailed review:

✅ Code Quality & Best Practices

Strengths:

Clean API design: The ArrayFormatterFactory trait provides a clear extension point for custom formatters
Excellent documentation: The example in the trait documentation (display.rs:289-370) is comprehensive and demonstrates real-world usage
Backward compatibility: The changes are additive and don't break existing APIs
Consistent naming: Follows Arrow-rs conventions well
Good test coverage: Three new tests cover the main use cases including edge cases

Suggestions:

Import organization (pretty.rs:25-28): The imports were reordered. While functional, ensure this follows the project's style guide (typically: std, external crates, internal crates)
Missing types_info in Hash impl (display.rs:107-128): The Hash implementation includes types_info but the PartialEq implementation does not. This violates Rust's hash/equality contract:
```
// Current PartialEq - missing types_info check
impl PartialEq for FormatOptions<'_> {
    fn eq(&self, other: &Self) -> bool {
        // ... missing self.types_info == other.types_info
    }
}
```
Impact: This can cause hash collisions and incorrect behavior when using FormatOptions in hash-based collections.

🐛 Potential Bugs

Critical: Hash/Equality Mismatch (display.rs:94-128)
- The PartialEq impl doesn't compare types_info, but Hash does hash it
- Fix: Add && self.types_info == other.types_info to the eq method
Pointer comparison for trait objects (display.rs:107):
- Using std::ptr::eq for trait object comparison is valid but may behave unexpectedly
- Two functionally identical factories will compare as unequal if they're different instances
- Documentation note: Consider documenting this behavior more explicitly in the struct docs
Error message clarity (pretty.rs:212):
```
"Expected the same number of columns in a record batch as the number of fields in the schema"
```
Consider including the actual counts for better debugging

⚡ Performance Considerations

Good: The factory method returns Option<ArrayFormatter>, allowing efficient fallback to default formatters without overhead
Minor concern (pretty.rs:216-226): The formatter creation happens inside the batch iteration loop:
```
for batch in results {
    let formatters = batch.columns().iter().zip(schema.fields().iter())
        .map(|(c, field)| match options.formatter_factory() { ... })
```
This is fine for typical use cases, but for very wide tables with many batches, consider if factory lookups could be cached
Trait object overhead: Using &dyn ArrayFormatterFactory introduces dynamic dispatch, but this is acceptable for formatting operations which are not performance-critical

🔒 Security Concerns

No significant security issues identified:

✅ No unsafe code introduced
✅ No unbounded allocations
✅ Input validation is present (schema field count check)
✅ Error handling is appropriate

🧪 Test Coverage

Good coverage provided:

test_format_batches_with_custom_formatters - Basic custom formatter usage
test_format_batches_with_custom_formatters_custom_schema_overrules_batch_schema - Schema override behavior
test_format_column_with_custom_formatters - Column formatting
test_pretty_format_batches_with_schema_with_wrong_number_of_fields - Error case

Suggestions for additional tests:

Test with nested types (structs, lists) to ensure custom formatters work recursively
Test behavior when factory returns Err(ArrowError) - currently not covered
Test with empty batches using custom formatters
Test thread safety if formatters are intended to be used concurrently

📝 Documentation

Excellent documentation overall, especially the trait example. Minor suggestions:

The FormatOptions equality documentation (display.rs:57-61) could mention the types_info field behavior once the bug is fixed
Consider adding a note about the lifetime requirements for the factory - users need to ensure the factory outlives the FormatOptions

🔧 Required Changes

Fix the PartialEq implementation to include types_info comparison

💡 Optional Improvements

Consider adding the actual field/column counts to the error message at pretty.rs:212
Consider documenting the pointer-equality semantics for formatter_factory more explicitly
Add a few more edge case tests as suggested above

Summary

This is a well-implemented feature with good design and documentation. The main issue is the hash/equality contract violation which should be fixed before merging. Once that's addressed, this will be a solid addition to Arrow-rs.

Recommendation: Request changes for the PartialEq/Hash bug, then approve after fix.

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-14T11:30:57Z

arrow-cast/src/pretty.rs

    }

    for batch in results {
+        let schema = schema_opt.as_ref().unwrap_or(batch.schema_ref());


unwrap_or(batch.schema_ref()) here returns a reference that may point to a temporary SchemaRef created by batch.schema_ref(), which doesn’t live long enough and is likely a compile-time borrow/lifetime error. Consider avoiding borrowing a temporary by using an owned SchemaRef instead of Option<&SchemaRef> at this point.

_{🤖 Was this useful? React with 👍 or 👎}

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! The schema is locally scoped in the loop, so its lifetime is just enough for the purpose of the code.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

arrow-cast/src/display.rs (1)

535-544: Public FormatError / FormatResult / DisplayIndex are useful; consider deriving Debug

Making FormatError, FormatResult, and DisplayIndex public is consistent with exposing ArrayFormatterFactory and allowing users to implement their own formatters:

FormatError clearly distinguishes between std::fmt::Error and ArrowError, and the From impls integrate it smoothly with existing code.

FormatResult = Result<(), FormatError> and the DisplayIndex trait give external implementations the same error semantics as the built‑in formatters.

One small improvement to consider: derive Debug for FormatError (and possibly mark it #[non_exhaustive] if you expect to extend it later). That would make it easier to log or surface detailed errors from custom formatters without hand‑rolling formatting.

Also applies to: 557-561

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f421d1 and d186741.

📒 Files selected for processing (2)

arrow-cast/src/display.rs (9 hunks)
arrow-cast/src/pretty.rs (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

arrow-cast/src/pretty.rs (1)

arrow-cast/src/display.rs (23)

try_new (481-483)

array_value_to_string (1287-1291)

create_display_index (368-373)

new (133-146)

new (474-476)

write (387-393)

write (560-560)

write (569-569)

write (579-581)

write (606-614)

write (618-621)

write (631-634)

write (673-676)

write (904-911)

write (915-934)

write (938-961)

write (1056-1059)

write (1063-1066)

write (1070-1076)

write (1080-1086)

write (1090-1096)

value (487-492)

with_formatter_factory (220-228)

arrow-cast/src/display.rs (1)

arrow-cast/src/pretty.rs (3)

Some (782-784)

Some (820-822)

create_display_index (1318-1342)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: claude-review

🔇 Additional comments (7)

arrow-cast/src/pretty.rs (4)

27-28: New imports align with later usage

Bringing Cell, Table, and Display into scope matches the use in create_table / create_column and the public pretty_* APIs; no issues here.

257-263: Column-level custom formatter wiring looks correct

The create_column logic mirrors the batch path:

If no formatter_factory is present, you delegate to ArrayFormatter::try_new as before.

If a factory is present, you call create_display_index(..., None) and:

Use the returned ArrayFormatter when it’s Ok(Some(_)).

Fall back to ArrayFormatter::try_new when it’s Ok(None).

Propagate any Err from the factory, which is reasonable for a user-provided component.

This is consistent with the behavior in create_table and should work well for custom per-column formatting.

1521-1546: Good regression test for schema/column count mismatch

test_pretty_format_batches_with_schema_with_wrong_number_of_fields nicely exercises the new guard in create_table:

You pass a schema with two fields while the batch has only one column.

The test asserts an ArrowError::InvalidArgumentError, matching the new check on columns().len() != fields().len().

This should prevent silent misalignment between schemas and batches when pretty-printing.

188-227: This review comment is incorrect – the code type-checks correctly.

The reviewer misidentified the return type of batch.schema_ref(). It returns &SchemaRef (a reference), not SchemaRef (an owned value).

Type verification:

Line 209: schema_opt.as_ref().unwrap_or(batch.schema_ref())

schema_opt is Option<SchemaRef>

.as_ref() produces Option<&SchemaRef>

.unwrap_or() expects a default of type &SchemaRef

batch.schema_ref() (from arrow-array/src/record_batch.rs line 406) returns &SchemaRef ✓

The types align perfectly. The code compiles without issues. No fix is needed.

Likely an incorrect or invalid review comment.

arrow-cast/src/display.rs (3)

57-62: FormatOptions formatter_factory integration and Eq/Hash semantics look consistent

The additions around FormatOptions are cohesive:

The new formatter_factory: Option<&'a dyn ArrayFormatterFactory> field is clearly documented as being compared by pointer identity.

PartialEq/Eq manually compare all existing fields by value and then:

Treat Some(f1) and Some(f2) as equal only if std::ptr::eq(f1, f2).

Distinguish Some(_) from None, which is what you want.

Hash mirrors this by hashing the raw pointer (*const dyn ArrayFormatterFactory) for the factory field, preserving the Eq/Hash invariant.

with_formatter_factory, without_formatter_factory, and formatter_factory() provide a straightforward API for configuring and querying the factory.

The new const getters (safe(), null(), *_format(), duration_format(), types_info()) are a nice improvement and keep the type const‑friendly.

Net effect: FormatOptions remains usable in maps/sets, with equality reflecting both formatting parameters and the specific factory instance, which is reasonable given the trait object. Just be aware that two factories with identical behavior but different identities will not compare equal, as the docs already call out.

Also applies to: 83-86, 94-129, 131-147, 219-288

291-374: ArrayFormatterFactory trait shape is appropriate for extensible formatting

The new ArrayFormatterFactory trait and its documentation are well-aligned with the custom-formatting use cases:

The signature create_display_index<'formatter>(&self, array, options, field) gives implementors:

The raw array,

The active FormatOptions (including duration/time formats), and

Optional Field metadata (for extension types, logical names, etc.).

Returning Result<Option<ArrayFormatter<'formatter>>, ArrowError> cleanly encodes:

Ok(Some(formatter)) → use custom formatter.

Ok(None) → fall back to the default formatter.

Err(e) → propagate a hard error, which is appropriate for user-provided factories.

The example in the doc comment (custom my_money formatter) matches how the trait is used from pretty.rs and is a good guide for users.

This provides a clear, minimal surface for plugging in custom display logic without disrupting the existing formatting pipeline.

472-483: ArrayFormatter::new constructor cleanly centralizes formatter wiring

Introducing ArrayFormatter::new and having try_new delegate to it is a nice cleanup:

new encapsulates the (format, safe) pairing, which is exactly what callers like custom factories need when they already have a Box<dyn DisplayIndex>.

try_new now simply calls make_formatter(array, options)? and forwards options.safe to new, preserving the previous behavior for default formatting.

This keeps the public API coherent while enabling external factories to build ArrayFormatter instances in a first-class way.

coderabbitai · 2025-11-14T11:37:33Z

arrow-cast/src/pretty.rs

+    use std::collections::HashMap;
    use std::fmt::Write;
    use std::sync::Arc;

    use half::f16;

    use arrow_array::builder::*;
+    use arrow_array::cast::AsArray;
    use arrow_array::types::*;
    use arrow_array::*;
    use arrow_buffer::{IntervalDayTime, IntervalMonthDayNano, ScalarBuffer};
    use arrow_schema::*;

-    use crate::display::{DurationFormat, array_value_to_string};
+    use crate::display::{
+        ArrayFormatterFactory, DisplayIndex, DurationFormat, array_value_to_string,
+    };

    use super::*;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Custom formatter tests are solid; avoid &TestFormatters {} temporaries

The custom-formatting scaffolding (TestFormatters, MyMoneyFormatter, MyInt32Formatter) and the associated tests cover both:

Extension-type–driven formatting (my_money via field metadata).

Type-driven fallback formatting for plain Int32 columns.

That’s great coverage. One concern is how the formatter factory is wired into FormatOptions in several tests:

let options = FormatOptions::new() .with_null("<NULL>") .with_formatter_factory(&TestFormatters {}); // ... let options = FormatOptions::new().with_formatter_factory(&TestFormatters {}); // ... FormatOptions::default().with_formatter_factory(&TestFormatters {})

These calls take a reference to a temporary TestFormatters value (&TestFormatters {}) and store it inside FormatOptions<'a>. This pattern is easy to run afoul of the borrow checker because the temporary would normally be dropped at the end of the statement, while options is used afterwards. Even if lifetime extension happens to make this compile in some cases, it’s non-obvious and fragile.

Safer, more idiomatic options:

Bind a named value per test:

let test_formatters = TestFormatters {}; let options = FormatOptions::new() .with_null("<NULL>") .with_formatter_factory(&test_formatters);

Or define a shared static for all tests:

#[derive(Debug)] struct TestFormatters; static TEST_FORMATTERS: TestFormatters = TestFormatters {}; let options = FormatOptions::new() .with_null("<NULL>") .with_formatter_factory(&TEST_FORMATTERS);

I recommend refactoring the tests to use one of these patterns so the lifetime of the factory is clearly tied to the lifetime of FormatOptions.

Also applies to: 1313-1519

🏁 Script executed:

#!/bin/bash # Find TestFormatters struct and verify usage pattern echo "=== Searching for TestFormatters struct ===" rg -n "struct TestFormatters" arrow-cast/src/pretty.rs -A 5 echo -e "\n=== Searching for with_formatter_factory calls ===" rg -n "with_formatter_factory.*TestFormatters" arrow-cast/src/pretty.rs -B 2 -A 2 echo -e "\n=== Inspect lines 1313-1519 region (test section) ===" sed -n '1300,1330p' arrow-cast/src/pretty.rs | head -50 echo -e "\n=== Check FormatOptions struct definition ===" rg -n "struct FormatOptions" arrow-cast/src/pretty.rs -A 3

Length of output: 2268

🏁 Script executed:

#!/bin/bash # Search across the codebase for FormatOptions definition and lifetime echo "=== Finding FormatOptions in arrow-cast crate ===" fd -e rs . arrow-cast/src | xargs rg "struct FormatOptions" -A 3

Length of output: 339

🏁 Script executed:

#!/bin/bash # Get full FormatOptions struct definition to see how formatter_factory is stored echo "=== Full FormatOptions struct definition ===" rg -n "pub struct FormatOptions" arrow-cast/src/display.rs -A 20 echo -e "\n=== with_formatter_factory method implementation ===" rg -n "fn with_formatter_factory" arrow-cast/src/display.rs -A 5 echo -e "\n=== Check if ArrayFormatterFactory is stored by reference ===" rg -n "ArrayFormatterFactory" arrow-cast/src/display.rs | head -10

Length of output: 2107

🏁 Script executed:

#!/bin/bash # Verify if the code actually compiles - check for build artifacts or test outcomes echo "=== Looking for formatter_factory field in FormatOptions ===" rg -n "formatter_factory" arrow-cast/src/display.rs -B 2 -A 2

Length of output: 2354

🏁 Script executed:

#!/bin/bash # Check how options is used after being assigned in the tests echo "=== Test 1: lines 1380-1395 (first test with formatter) ===" sed -n '1380,1395p' arrow-cast/src/pretty.rs echo -e "\n=== Test 2: lines 1428-1445 (second test) ===" sed -n '1428,1445p' arrow-cast/src/pretty.rs echo -e "\n=== Test 3: lines 1497-1510 (third test) ===" sed -n '1497,1510p' arrow-cast/src/pretty.rs

Length of output: 1877

Refactor temporary references to named bindings or statics in custom formatter tests

The review correctly identifies a fragile lifetime pattern. FormatOptions<'a> stores borrowed references (formatter_factory: Option<&'a dyn ArrayFormatterFactory>), and the tests pass temporary &TestFormatters {} values that should be dropped after the statement. While the code likely compiles due to Rust's temporary lifetime extension, this pattern is non-idiomatic and should be refactored.

Three locations require fixes:

Line 1384: .with_formatter_factory(&TestFormatters {})

Line 1430: .with_formatter_factory(&TestFormatters {})

Line 1501: .with_formatter_factory(&TestFormatters {})

Use either named bindings:

let test_formatters = TestFormatters {}; let options = FormatOptions::new().with_formatter_factory(&test_formatters);

Or a static:

static TEST_FORMATTERS: TestFormatters = TestFormatters {}; let options = FormatOptions::new().with_formatter_factory(&TEST_FORMATTERS);

🤖 Prompt for AI Agents

arrow-cast/src/pretty.rs lines 275-292; in tests around lines 1384, 1430, and 1501 replace temporary references like .with_formatter_factory(&TestFormatters {}) with a long-lived binding: either create a local named binding (let test_formatters = TestFormatters {}; then pass &test_formatters to with_formatter_factory) or declare a static instance (static TEST_FORMATTERS: TestFormatters = TestFormatters {}; then pass &TEST_FORMATTERS) so the FormatOptions<'a> does not hold a reference to a dropped temporary.

value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! The formatter factory is supposed to live at least the lifetime of the FormatOptions, so using the 'a lifetime should work.

martin-augment · 2025-11-14T14:23:32Z

2. Missing types_info in Hash impl (display.rs:107-128): The Hash implementation includes types_info but the PartialEq implementation does not. This violates Rust's hash/equality contract:
rust // Current PartialEq - missing types_info check impl PartialEq for FormatOptions<'_> { fn eq(&self, other: &Self) -> bool { // ... missing self.types_info == other.types_info } }
Impact: This can cause hash collisions and incorrect behavior when using FormatOptions in hash-based collections.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions.

tobixdev added 8 commits November 12, 2025 23:22

Draft for implementing custom ArrayFormatters

4ff1f3f

Improve custom pretty printing

47f00a7

Add sanity check for number of columns

37bac6f

Formatting

1e8f101

Minor fixes

fcc6478

Use accessors in FormatOptions

0599d17

Move ArrayFormatterFactory into FormatOptions

6882394

Move ArrayFormatterFactory to display module to avoid issues with fea…

d186741

…ture flags

cursor bot reviewed Nov 14, 2025

View reviewed changes

augmentcode bot reviewed Nov 14, 2025

View reviewed changes

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8829: Allow Users to Provide Custom `ArrayFormatter`s when Pretty-Printing Record Batches#22

8829: Allow Users to Provide Custom `ArrayFormatter`s when Pretty-Printing Record Batches#22
martin-augment wants to merge 8 commits intomainfrom
pr-8829-2025-11-14-11-28-04

martin-augment commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

cursor bot Nov 14, 2025

Uh oh!

martin-augment Nov 14, 2025

Uh oh!

claude bot commented Nov 14, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Nov 14, 2025

Uh oh!

martin-augment Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

martin-augment Nov 14, 2025

Uh oh!

martin-augment commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

cursor bot Nov 14, 2025

Choose a reason for hiding this comment

Bug: Broken Hash/Equality Contract

Uh oh!

martin-augment Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 14, 2025

Code Review

✅ Code Quality & Best Practices

🐛 Potential Bugs

⚡ Performance Considerations

🔒 Security Concerns

🧪 Test Coverage

📝 Documentation

🔧 Required Changes

💡 Optional Improvements

Summary

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 14, 2025 •

edited

Loading