Skip to content

8829: Allow Users to Provide Custom ArrayFormatters when Pretty-Printing Record Batches#22

Open
martin-augment wants to merge 8 commits intomainfrom
pr-8829-2025-11-14-11-28-04
Open

8829: Allow Users to Provide Custom ArrayFormatters when Pretty-Printing Record Batches#22
martin-augment wants to merge 8 commits intomainfrom
pr-8829-2025-11-14-11-28-04

Conversation

@martin-augment
Copy link
Owner

8829: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

This change introduces a customizable formatting mechanism for array display in the pretty-printing functionality. A new formatter_factory field is added to FormatOptions<'a>, enabling user-provided custom formatters for specific array types. The formatting infrastructure (DisplayIndex trait, FormatError enum, FormatResult type) is made public. The pretty-printing logic is updated to validate schema field counts, align columns with schema fields, and apply custom formatters when available. Comprehensive tests demonstrate custom formatting of batches and columns with schema overrides. No public API signatures are modified in pretty.rs.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8829-2025-11-14-11-28-04

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

(None, None) => true,
_ => false,
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Broken Hash/Equality Contract

The PartialEq implementation for FormatOptions does not compare the types_info field, but the Hash implementation includes it. This violates Rust's hash/equality contract: if two values are equal according to PartialEq, they must produce the same hash. FormatOptions values differing only in types_info will be equal but have different hashes, causing issues in hash-based collections.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions.

@claude
Copy link

claude bot commented Nov 14, 2025

Code Review

Thank you for this PR! This is a well-designed feature that adds extensibility to the pretty-printing functionality. Here's my detailed review:


✅ Code Quality & Best Practices

Strengths:

  • Clean API design: The ArrayFormatterFactory trait provides a clear extension point for custom formatters
  • Excellent documentation: The example in the trait documentation (display.rs:289-370) is comprehensive and demonstrates real-world usage
  • Backward compatibility: The changes are additive and don't break existing APIs
  • Consistent naming: Follows Arrow-rs conventions well
  • Good test coverage: Three new tests cover the main use cases including edge cases

Suggestions:

  1. Import organization (pretty.rs:25-28): The imports were reordered. While functional, ensure this follows the project's style guide (typically: std, external crates, internal crates)

  2. Missing types_info in Hash impl (display.rs:107-128): The Hash implementation includes types_info but the PartialEq implementation does not. This violates Rust's hash/equality contract:

    // Current PartialEq - missing types_info check
    impl PartialEq for FormatOptions<'_> {
        fn eq(&self, other: &Self) -> bool {
            // ... missing self.types_info == other.types_info
        }
    }

    Impact: This can cause hash collisions and incorrect behavior when using FormatOptions in hash-based collections.


🐛 Potential Bugs

  1. Critical: Hash/Equality Mismatch (display.rs:94-128)

    • The PartialEq impl doesn't compare types_info, but Hash does hash it
    • Fix: Add && self.types_info == other.types_info to the eq method
  2. Pointer comparison for trait objects (display.rs:107):

    • Using std::ptr::eq for trait object comparison is valid but may behave unexpectedly
    • Two functionally identical factories will compare as unequal if they're different instances
    • Documentation note: Consider documenting this behavior more explicitly in the struct docs
  3. Error message clarity (pretty.rs:212):

    "Expected the same number of columns in a record batch as the number of fields in the schema"

    Consider including the actual counts for better debugging


⚡ Performance Considerations

  1. Good: The factory method returns Option<ArrayFormatter>, allowing efficient fallback to default formatters without overhead

  2. Minor concern (pretty.rs:216-226): The formatter creation happens inside the batch iteration loop:

    for batch in results {
        let formatters = batch.columns().iter().zip(schema.fields().iter())
            .map(|(c, field)| match options.formatter_factory() { ... })

    This is fine for typical use cases, but for very wide tables with many batches, consider if factory lookups could be cached

  3. Trait object overhead: Using &dyn ArrayFormatterFactory introduces dynamic dispatch, but this is acceptable for formatting operations which are not performance-critical


🔒 Security Concerns

No significant security issues identified:

  • ✅ No unsafe code introduced
  • ✅ No unbounded allocations
  • ✅ Input validation is present (schema field count check)
  • ✅ Error handling is appropriate

🧪 Test Coverage

Good coverage provided:

  1. test_format_batches_with_custom_formatters - Basic custom formatter usage
  2. test_format_batches_with_custom_formatters_custom_schema_overrules_batch_schema - Schema override behavior
  3. test_format_column_with_custom_formatters - Column formatting
  4. test_pretty_format_batches_with_schema_with_wrong_number_of_fields - Error case

Suggestions for additional tests:

  1. Test with nested types (structs, lists) to ensure custom formatters work recursively
  2. Test behavior when factory returns Err(ArrowError) - currently not covered
  3. Test with empty batches using custom formatters
  4. Test thread safety if formatters are intended to be used concurrently

📝 Documentation

Excellent documentation overall, especially the trait example. Minor suggestions:

  1. The FormatOptions equality documentation (display.rs:57-61) could mention the types_info field behavior once the bug is fixed

  2. Consider adding a note about the lifetime requirements for the factory - users need to ensure the factory outlives the FormatOptions


🔧 Required Changes

  1. Fix the PartialEq implementation to include types_info comparison

💡 Optional Improvements

  1. Consider adding the actual field/column counts to the error message at pretty.rs:212
  2. Consider documenting the pointer-equality semantics for formatter_factory more explicitly
  3. Add a few more edge case tests as suggested above

Summary

This is a well-implemented feature with good design and documentation. The main issue is the hash/equality contract violation which should be fixed before merging. Once that's addressed, this will be a solid addition to Arrow-rs.

Recommendation: Request changes for the PartialEq/Hash bug, then approve after fix.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

}

for batch in results {
let schema = schema_opt.as_ref().unwrap_or(batch.schema_ref());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_or(batch.schema_ref()) here returns a reference that may point to a temporary SchemaRef created by batch.schema_ref(), which doesn’t live long enough and is likely a compile-time borrow/lifetime error. Consider avoiding borrowing a temporary by using an owned SchemaRef instead of Option<&SchemaRef> at this point.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! The schema is locally scoped in the loop, so its lifetime is just enough for the purpose of the code.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
arrow-cast/src/display.rs (1)

535-544: Public FormatError / FormatResult / DisplayIndex are useful; consider deriving Debug

Making FormatError, FormatResult, and DisplayIndex public is consistent with exposing ArrayFormatterFactory and allowing users to implement their own formatters:

  • FormatError clearly distinguishes between std::fmt::Error and ArrowError, and the From impls integrate it smoothly with existing code.
  • FormatResult = Result<(), FormatError> and the DisplayIndex trait give external implementations the same error semantics as the built‑in formatters.

One small improvement to consider: derive Debug for FormatError (and possibly mark it #[non_exhaustive] if you expect to extend it later). That would make it easier to log or surface detailed errors from custom formatters without hand‑rolling formatting.

Also applies to: 557-561

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f421d1 and d186741.

📒 Files selected for processing (2)
  • arrow-cast/src/display.rs (9 hunks)
  • arrow-cast/src/pretty.rs (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
arrow-cast/src/pretty.rs (1)
arrow-cast/src/display.rs (23)
  • try_new (481-483)
  • array_value_to_string (1287-1291)
  • create_display_index (368-373)
  • new (133-146)
  • new (474-476)
  • write (387-393)
  • write (560-560)
  • write (569-569)
  • write (579-581)
  • write (606-614)
  • write (618-621)
  • write (631-634)
  • write (673-676)
  • write (904-911)
  • write (915-934)
  • write (938-961)
  • write (1056-1059)
  • write (1063-1066)
  • write (1070-1076)
  • write (1080-1086)
  • write (1090-1096)
  • value (487-492)
  • with_formatter_factory (220-228)
arrow-cast/src/display.rs (1)
arrow-cast/src/pretty.rs (3)
  • Some (782-784)
  • Some (820-822)
  • create_display_index (1318-1342)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: claude-review
🔇 Additional comments (7)
arrow-cast/src/pretty.rs (4)

27-28: New imports align with later usage

Bringing Cell, Table, and Display into scope matches the use in create_table / create_column and the public pretty_* APIs; no issues here.


257-263: Column-level custom formatter wiring looks correct

The create_column logic mirrors the batch path:

  • If no formatter_factory is present, you delegate to ArrayFormatter::try_new as before.
  • If a factory is present, you call create_display_index(..., None) and:
    • Use the returned ArrayFormatter when it’s Ok(Some(_)).
    • Fall back to ArrayFormatter::try_new when it’s Ok(None).
    • Propagate any Err from the factory, which is reasonable for a user-provided component.

This is consistent with the behavior in create_table and should work well for custom per-column formatting.


1521-1546: Good regression test for schema/column count mismatch

test_pretty_format_batches_with_schema_with_wrong_number_of_fields nicely exercises the new guard in create_table:

  • You pass a schema with two fields while the batch has only one column.
  • The test asserts an ArrowError::InvalidArgumentError, matching the new check on columns().len() != fields().len().

This should prevent silent misalignment between schemas and batches when pretty-printing.


188-227: This review comment is incorrect – the code type-checks correctly.

The reviewer misidentified the return type of batch.schema_ref(). It returns &SchemaRef (a reference), not SchemaRef (an owned value).

Type verification:

  • Line 209: schema_opt.as_ref().unwrap_or(batch.schema_ref())
  • schema_opt is Option<SchemaRef>
  • .as_ref() produces Option<&SchemaRef>
  • .unwrap_or() expects a default of type &SchemaRef
  • batch.schema_ref() (from arrow-array/src/record_batch.rs line 406) returns &SchemaRef

The types align perfectly. The code compiles without issues. No fix is needed.

Likely an incorrect or invalid review comment.

arrow-cast/src/display.rs (3)

57-62: FormatOptions formatter_factory integration and Eq/Hash semantics look consistent

The additions around FormatOptions are cohesive:

  • The new formatter_factory: Option<&'a dyn ArrayFormatterFactory> field is clearly documented as being compared by pointer identity.
  • PartialEq/Eq manually compare all existing fields by value and then:
    • Treat Some(f1) and Some(f2) as equal only if std::ptr::eq(f1, f2).
    • Distinguish Some(_) from None, which is what you want.
  • Hash mirrors this by hashing the raw pointer (*const dyn ArrayFormatterFactory) for the factory field, preserving the Eq/Hash invariant.
  • with_formatter_factory, without_formatter_factory, and formatter_factory() provide a straightforward API for configuring and querying the factory.
  • The new const getters (safe(), null(), *_format(), duration_format(), types_info()) are a nice improvement and keep the type const‑friendly.

Net effect: FormatOptions remains usable in maps/sets, with equality reflecting both formatting parameters and the specific factory instance, which is reasonable given the trait object. Just be aware that two factories with identical behavior but different identities will not compare equal, as the docs already call out.

Also applies to: 83-86, 94-129, 131-147, 219-288


291-374: ArrayFormatterFactory trait shape is appropriate for extensible formatting

The new ArrayFormatterFactory trait and its documentation are well-aligned with the custom-formatting use cases:

  • The signature create_display_index<'formatter>(&self, array, options, field) gives implementors:
    • The raw array,
    • The active FormatOptions (including duration/time formats), and
    • Optional Field metadata (for extension types, logical names, etc.).
  • Returning Result<Option<ArrayFormatter<'formatter>>, ArrowError> cleanly encodes:
    • Ok(Some(formatter)) → use custom formatter.
    • Ok(None) → fall back to the default formatter.
    • Err(e) → propagate a hard error, which is appropriate for user-provided factories.
  • The example in the doc comment (custom my_money formatter) matches how the trait is used from pretty.rs and is a good guide for users.

This provides a clear, minimal surface for plugging in custom display logic without disrupting the existing formatting pipeline.


472-483: ArrayFormatter::new constructor cleanly centralizes formatter wiring

Introducing ArrayFormatter::new and having try_new delegate to it is a nice cleanup:

  • new encapsulates the (format, safe) pairing, which is exactly what callers like custom factories need when they already have a Box<dyn DisplayIndex>.
  • try_new now simply calls make_formatter(array, options)? and forwards options.safe to new, preserving the previous behavior for default formatting.

This keeps the public API coherent while enabling external factories to build ArrayFormatter instances in a first-class way.

Comment on lines +275 to 292
use std::collections::HashMap;
use std::fmt::Write;
use std::sync::Arc;

use half::f16;

use arrow_array::builder::*;
use arrow_array::cast::AsArray;
use arrow_array::types::*;
use arrow_array::*;
use arrow_buffer::{IntervalDayTime, IntervalMonthDayNano, ScalarBuffer};
use arrow_schema::*;

use crate::display::{DurationFormat, array_value_to_string};
use crate::display::{
ArrayFormatterFactory, DisplayIndex, DurationFormat, array_value_to_string,
};

use super::*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Custom formatter tests are solid; avoid &TestFormatters {} temporaries

The custom-formatting scaffolding (TestFormatters, MyMoneyFormatter, MyInt32Formatter) and the associated tests cover both:

  • Extension-type–driven formatting (my_money via field metadata).
  • Type-driven fallback formatting for plain Int32 columns.

That’s great coverage. One concern is how the formatter factory is wired into FormatOptions in several tests:

let options = FormatOptions::new()
    .with_null("<NULL>")
    .with_formatter_factory(&TestFormatters {});
// ...
let options = FormatOptions::new().with_formatter_factory(&TestFormatters {});
// ...
FormatOptions::default().with_formatter_factory(&TestFormatters {})

These calls take a reference to a temporary TestFormatters value (&TestFormatters {}) and store it inside FormatOptions<'a>. This pattern is easy to run afoul of the borrow checker because the temporary would normally be dropped at the end of the statement, while options is used afterwards. Even if lifetime extension happens to make this compile in some cases, it’s non-obvious and fragile.

Safer, more idiomatic options:

  • Bind a named value per test:
let test_formatters = TestFormatters {};
let options = FormatOptions::new()
    .with_null("<NULL>")
    .with_formatter_factory(&test_formatters);
  • Or define a shared static for all tests:
#[derive(Debug)]
struct TestFormatters;

static TEST_FORMATTERS: TestFormatters = TestFormatters {};

let options = FormatOptions::new()
    .with_null("<NULL>")
    .with_formatter_factory(&TEST_FORMATTERS);

I recommend refactoring the tests to use one of these patterns so the lifetime of the factory is clearly tied to the lifetime of FormatOptions.

Also applies to: 1313-1519


🏁 Script executed:

#!/bin/bash
# Find TestFormatters struct and verify usage pattern
echo "=== Searching for TestFormatters struct ==="
rg -n "struct TestFormatters" arrow-cast/src/pretty.rs -A 5

echo -e "\n=== Searching for with_formatter_factory calls ==="
rg -n "with_formatter_factory.*TestFormatters" arrow-cast/src/pretty.rs -B 2 -A 2

echo -e "\n=== Inspect lines 1313-1519 region (test section) ==="
sed -n '1300,1330p' arrow-cast/src/pretty.rs | head -50

echo -e "\n=== Check FormatOptions struct definition ==="
rg -n "struct FormatOptions" arrow-cast/src/pretty.rs -A 3

Length of output: 2268


🏁 Script executed:

#!/bin/bash
# Search across the codebase for FormatOptions definition and lifetime
echo "=== Finding FormatOptions in arrow-cast crate ==="
fd -e rs . arrow-cast/src | xargs rg "struct FormatOptions" -A 3

Length of output: 339


🏁 Script executed:

#!/bin/bash
# Get full FormatOptions struct definition to see how formatter_factory is stored
echo "=== Full FormatOptions struct definition ==="
rg -n "pub struct FormatOptions" arrow-cast/src/display.rs -A 20

echo -e "\n=== with_formatter_factory method implementation ==="
rg -n "fn with_formatter_factory" arrow-cast/src/display.rs -A 5

echo -e "\n=== Check if ArrayFormatterFactory is stored by reference ==="
rg -n "ArrayFormatterFactory" arrow-cast/src/display.rs | head -10

Length of output: 2107


🏁 Script executed:

#!/bin/bash
# Verify if the code actually compiles - check for build artifacts or test outcomes
echo "=== Looking for formatter_factory field in FormatOptions ==="
rg -n "formatter_factory" arrow-cast/src/display.rs -B 2 -A 2

Length of output: 2354


🏁 Script executed:

#!/bin/bash
# Check how options is used after being assigned in the tests
echo "=== Test 1: lines 1380-1395 (first test with formatter) ==="
sed -n '1380,1395p' arrow-cast/src/pretty.rs

echo -e "\n=== Test 2: lines 1428-1445 (second test) ==="
sed -n '1428,1445p' arrow-cast/src/pretty.rs

echo -e "\n=== Test 3: lines 1497-1510 (third test) ==="
sed -n '1497,1510p' arrow-cast/src/pretty.rs

Length of output: 1877


Refactor temporary references to named bindings or statics in custom formatter tests

The review correctly identifies a fragile lifetime pattern. FormatOptions<'a> stores borrowed references (formatter_factory: Option<&'a dyn ArrayFormatterFactory>), and the tests pass temporary &TestFormatters {} values that should be dropped after the statement. While the code likely compiles due to Rust's temporary lifetime extension, this pattern is non-idiomatic and should be refactored.

Three locations require fixes:

  • Line 1384: .with_formatter_factory(&TestFormatters {})
  • Line 1430: .with_formatter_factory(&TestFormatters {})
  • Line 1501: .with_formatter_factory(&TestFormatters {})

Use either named bindings:

let test_formatters = TestFormatters {};
let options = FormatOptions::new().with_formatter_factory(&test_formatters);

Or a static:

static TEST_FORMATTERS: TestFormatters = TestFormatters {};
let options = FormatOptions::new().with_formatter_factory(&TEST_FORMATTERS);
🤖 Prompt for AI Agents
arrow-cast/src/pretty.rs lines 275-292; in tests around lines 1384, 1430, and
1501 replace temporary references like .with_formatter_factory(&TestFormatters
{}) with a long-lived binding: either create a local named binding (let
test_formatters = TestFormatters {}; then pass &test_formatters to
with_formatter_factory) or declare a static instance (static TEST_FORMATTERS:
TestFormatters = TestFormatters {}; then pass &TEST_FORMATTERS) so the
FormatOptions<'a> does not hold a reference to a dropped temporary.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! The formatter factory is supposed to live at least the lifetime of the FormatOptions, so using the 'a lifetime should work.

@martin-augment
Copy link
Owner Author

2. Missing types_info in Hash impl (display.rs:107-128): The Hash implementation includes types_info but the PartialEq implementation does not. This violates Rust's hash/equality contract:
rust // Current PartialEq - missing types_info check impl PartialEq for FormatOptions<'_> { fn eq(&self, other: &Self) -> bool { // ... missing self.types_info == other.types_info } }
Impact: This can cause hash collisions and incorrect behavior when using FormatOptions in hash-based collections.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants