8829: Allow Users to Provide Custom ArrayFormatters when Pretty-Printing Record Batches#22
8829: Allow Users to Provide Custom ArrayFormatters when Pretty-Printing Record Batches#22martin-augment wants to merge 8 commits intomainfrom
ArrayFormatters when Pretty-Printing Record Batches#22Conversation
WalkthroughThis change introduces a customizable formatting mechanism for array display in the pretty-printing functionality. A new ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| (None, None) => true, | ||
| _ => false, | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: Broken Hash/Equality Contract
The PartialEq implementation for FormatOptions does not compare the types_info field, but the Hash implementation includes it. This violates Rust's hash/equality contract: if two values are equal according to PartialEq, they must produce the same hash. FormatOptions values differing only in types_info will be equal but have different hashes, causing issues in hash-based collections.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions.
Code ReviewThank you for this PR! This is a well-designed feature that adds extensibility to the pretty-printing functionality. Here's my detailed review: ✅ Code Quality & Best PracticesStrengths:
Suggestions:
🐛 Potential Bugs
⚡ Performance Considerations
🔒 Security ConcernsNo significant security issues identified:
🧪 Test CoverageGood coverage provided:
Suggestions for additional tests:
📝 DocumentationExcellent documentation overall, especially the trait example. Minor suggestions:
🔧 Required Changes
💡 Optional Improvements
SummaryThis is a well-implemented feature with good design and documentation. The main issue is the hash/equality contract violation which should be fixed before merging. Once that's addressed, this will be a solid addition to Arrow-rs. Recommendation: Request changes for the |
| } | ||
|
|
||
| for batch in results { | ||
| let schema = schema_opt.as_ref().unwrap_or(batch.schema_ref()); |
There was a problem hiding this comment.
unwrap_or(batch.schema_ref()) here returns a reference that may point to a temporary SchemaRef created by batch.schema_ref(), which doesn’t live long enough and is likely a compile-time borrow/lifetime error. Consider avoiding borrowing a temporary by using an owned SchemaRef instead of Option<&SchemaRef> at this point.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! The schema is locally scoped in the loop, so its lifetime is just enough for the purpose of the code.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
arrow-cast/src/display.rs (1)
535-544: PublicFormatError/FormatResult/DisplayIndexare useful; consider derivingDebugMaking
FormatError,FormatResult, andDisplayIndexpublic is consistent with exposingArrayFormatterFactoryand allowing users to implement their own formatters:
FormatErrorclearly distinguishes betweenstd::fmt::ErrorandArrowError, and theFromimpls integrate it smoothly with existing code.FormatResult = Result<(), FormatError>and theDisplayIndextrait give external implementations the same error semantics as the built‑in formatters.One small improvement to consider: derive
DebugforFormatError(and possibly mark it#[non_exhaustive]if you expect to extend it later). That would make it easier to log or surface detailed errors from custom formatters without hand‑rolling formatting.Also applies to: 557-561
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
arrow-cast/src/display.rs(9 hunks)arrow-cast/src/pretty.rs(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
arrow-cast/src/pretty.rs (1)
arrow-cast/src/display.rs (23)
try_new(481-483)array_value_to_string(1287-1291)create_display_index(368-373)new(133-146)new(474-476)write(387-393)write(560-560)write(569-569)write(579-581)write(606-614)write(618-621)write(631-634)write(673-676)write(904-911)write(915-934)write(938-961)write(1056-1059)write(1063-1066)write(1070-1076)write(1080-1086)write(1090-1096)value(487-492)with_formatter_factory(220-228)
arrow-cast/src/display.rs (1)
arrow-cast/src/pretty.rs (3)
Some(782-784)Some(820-822)create_display_index(1318-1342)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: claude-review
🔇 Additional comments (7)
arrow-cast/src/pretty.rs (4)
27-28: New imports align with later usageBringing
Cell,Table, andDisplayinto scope matches the use increate_table/create_columnand the publicpretty_*APIs; no issues here.
257-263: Column-level custom formatter wiring looks correctThe
create_columnlogic mirrors the batch path:
- If no
formatter_factoryis present, you delegate toArrayFormatter::try_newas before.- If a factory is present, you call
create_display_index(..., None)and:
- Use the returned
ArrayFormatterwhen it’sOk(Some(_)).- Fall back to
ArrayFormatter::try_newwhen it’sOk(None).- Propagate any
Errfrom the factory, which is reasonable for a user-provided component.This is consistent with the behavior in
create_tableand should work well for custom per-column formatting.
1521-1546: Good regression test for schema/column count mismatch
test_pretty_format_batches_with_schema_with_wrong_number_of_fieldsnicely exercises the new guard increate_table:
- You pass a schema with two fields while the batch has only one column.
- The test asserts an
ArrowError::InvalidArgumentError, matching the new check oncolumns().len() != fields().len().This should prevent silent misalignment between schemas and batches when pretty-printing.
188-227: This review comment is incorrect – the code type-checks correctly.The reviewer misidentified the return type of
batch.schema_ref(). It returns&SchemaRef(a reference), notSchemaRef(an owned value).Type verification:
- Line 209:
schema_opt.as_ref().unwrap_or(batch.schema_ref())schema_optisOption<SchemaRef>.as_ref()producesOption<&SchemaRef>.unwrap_or()expects a default of type&SchemaRefbatch.schema_ref()(from arrow-array/src/record_batch.rs line 406) returns&SchemaRef✓The types align perfectly. The code compiles without issues. No fix is needed.
Likely an incorrect or invalid review comment.
arrow-cast/src/display.rs (3)
57-62:FormatOptionsformatter_factory integration and Eq/Hash semantics look consistentThe additions around
FormatOptionsare cohesive:
- The new
formatter_factory: Option<&'a dyn ArrayFormatterFactory>field is clearly documented as being compared by pointer identity.PartialEq/Eqmanually compare all existing fields by value and then:
- Treat
Some(f1)andSome(f2)as equal only ifstd::ptr::eq(f1, f2).- Distinguish
Some(_)fromNone, which is what you want.Hashmirrors this by hashing the raw pointer (*const dyn ArrayFormatterFactory) for the factory field, preserving the Eq/Hash invariant.with_formatter_factory,without_formatter_factory, andformatter_factory()provide a straightforward API for configuring and querying the factory.- The new const getters (
safe(),null(),*_format(),duration_format(),types_info()) are a nice improvement and keep the type const‑friendly.Net effect:
FormatOptionsremains usable in maps/sets, with equality reflecting both formatting parameters and the specific factory instance, which is reasonable given the trait object. Just be aware that two factories with identical behavior but different identities will not compare equal, as the docs already call out.Also applies to: 83-86, 94-129, 131-147, 219-288
291-374:ArrayFormatterFactorytrait shape is appropriate for extensible formattingThe new
ArrayFormatterFactorytrait and its documentation are well-aligned with the custom-formatting use cases:
- The signature
create_display_index<'formatter>(&self, array, options, field)gives implementors:
- The raw array,
- The active
FormatOptions(including duration/time formats), and- Optional
Fieldmetadata (for extension types, logical names, etc.).- Returning
Result<Option<ArrayFormatter<'formatter>>, ArrowError>cleanly encodes:
Ok(Some(formatter))→ use custom formatter.Ok(None)→ fall back to the default formatter.Err(e)→ propagate a hard error, which is appropriate for user-provided factories.- The example in the doc comment (custom
my_moneyformatter) matches how the trait is used frompretty.rsand is a good guide for users.This provides a clear, minimal surface for plugging in custom display logic without disrupting the existing formatting pipeline.
472-483:ArrayFormatter::newconstructor cleanly centralizes formatter wiringIntroducing
ArrayFormatter::newand havingtry_newdelegate to it is a nice cleanup:
newencapsulates the(format, safe)pairing, which is exactly what callers like custom factories need when they already have aBox<dyn DisplayIndex>.try_newnow simply callsmake_formatter(array, options)?and forwardsoptions.safetonew, preserving the previous behavior for default formatting.This keeps the public API coherent while enabling external factories to build
ArrayFormatterinstances in a first-class way.
| use std::collections::HashMap; | ||
| use std::fmt::Write; | ||
| use std::sync::Arc; | ||
|
|
||
| use half::f16; | ||
|
|
||
| use arrow_array::builder::*; | ||
| use arrow_array::cast::AsArray; | ||
| use arrow_array::types::*; | ||
| use arrow_array::*; | ||
| use arrow_buffer::{IntervalDayTime, IntervalMonthDayNano, ScalarBuffer}; | ||
| use arrow_schema::*; | ||
|
|
||
| use crate::display::{DurationFormat, array_value_to_string}; | ||
| use crate::display::{ | ||
| ArrayFormatterFactory, DisplayIndex, DurationFormat, array_value_to_string, | ||
| }; | ||
|
|
||
| use super::*; |
There was a problem hiding this comment.
🧩 Analysis chain
Custom formatter tests are solid; avoid &TestFormatters {} temporaries
The custom-formatting scaffolding (TestFormatters, MyMoneyFormatter, MyInt32Formatter) and the associated tests cover both:
- Extension-type–driven formatting (
my_moneyvia field metadata). - Type-driven fallback formatting for plain
Int32columns.
That’s great coverage. One concern is how the formatter factory is wired into FormatOptions in several tests:
let options = FormatOptions::new()
.with_null("<NULL>")
.with_formatter_factory(&TestFormatters {});
// ...
let options = FormatOptions::new().with_formatter_factory(&TestFormatters {});
// ...
FormatOptions::default().with_formatter_factory(&TestFormatters {})These calls take a reference to a temporary TestFormatters value (&TestFormatters {}) and store it inside FormatOptions<'a>. This pattern is easy to run afoul of the borrow checker because the temporary would normally be dropped at the end of the statement, while options is used afterwards. Even if lifetime extension happens to make this compile in some cases, it’s non-obvious and fragile.
Safer, more idiomatic options:
- Bind a named value per test:
let test_formatters = TestFormatters {};
let options = FormatOptions::new()
.with_null("<NULL>")
.with_formatter_factory(&test_formatters);- Or define a shared static for all tests:
#[derive(Debug)]
struct TestFormatters;
static TEST_FORMATTERS: TestFormatters = TestFormatters {};
let options = FormatOptions::new()
.with_null("<NULL>")
.with_formatter_factory(&TEST_FORMATTERS);I recommend refactoring the tests to use one of these patterns so the lifetime of the factory is clearly tied to the lifetime of FormatOptions.
Also applies to: 1313-1519
🏁 Script executed:
#!/bin/bash
# Find TestFormatters struct and verify usage pattern
echo "=== Searching for TestFormatters struct ==="
rg -n "struct TestFormatters" arrow-cast/src/pretty.rs -A 5
echo -e "\n=== Searching for with_formatter_factory calls ==="
rg -n "with_formatter_factory.*TestFormatters" arrow-cast/src/pretty.rs -B 2 -A 2
echo -e "\n=== Inspect lines 1313-1519 region (test section) ==="
sed -n '1300,1330p' arrow-cast/src/pretty.rs | head -50
echo -e "\n=== Check FormatOptions struct definition ==="
rg -n "struct FormatOptions" arrow-cast/src/pretty.rs -A 3Length of output: 2268
🏁 Script executed:
#!/bin/bash
# Search across the codebase for FormatOptions definition and lifetime
echo "=== Finding FormatOptions in arrow-cast crate ==="
fd -e rs . arrow-cast/src | xargs rg "struct FormatOptions" -A 3Length of output: 339
🏁 Script executed:
#!/bin/bash
# Get full FormatOptions struct definition to see how formatter_factory is stored
echo "=== Full FormatOptions struct definition ==="
rg -n "pub struct FormatOptions" arrow-cast/src/display.rs -A 20
echo -e "\n=== with_formatter_factory method implementation ==="
rg -n "fn with_formatter_factory" arrow-cast/src/display.rs -A 5
echo -e "\n=== Check if ArrayFormatterFactory is stored by reference ==="
rg -n "ArrayFormatterFactory" arrow-cast/src/display.rs | head -10Length of output: 2107
🏁 Script executed:
#!/bin/bash
# Verify if the code actually compiles - check for build artifacts or test outcomes
echo "=== Looking for formatter_factory field in FormatOptions ==="
rg -n "formatter_factory" arrow-cast/src/display.rs -B 2 -A 2Length of output: 2354
🏁 Script executed:
#!/bin/bash
# Check how options is used after being assigned in the tests
echo "=== Test 1: lines 1380-1395 (first test with formatter) ==="
sed -n '1380,1395p' arrow-cast/src/pretty.rs
echo -e "\n=== Test 2: lines 1428-1445 (second test) ==="
sed -n '1428,1445p' arrow-cast/src/pretty.rs
echo -e "\n=== Test 3: lines 1497-1510 (third test) ==="
sed -n '1497,1510p' arrow-cast/src/pretty.rsLength of output: 1877
Refactor temporary references to named bindings or statics in custom formatter tests
The review correctly identifies a fragile lifetime pattern. FormatOptions<'a> stores borrowed references (formatter_factory: Option<&'a dyn ArrayFormatterFactory>), and the tests pass temporary &TestFormatters {} values that should be dropped after the statement. While the code likely compiles due to Rust's temporary lifetime extension, this pattern is non-idiomatic and should be refactored.
Three locations require fixes:
- Line 1384:
.with_formatter_factory(&TestFormatters {}) - Line 1430:
.with_formatter_factory(&TestFormatters {}) - Line 1501:
.with_formatter_factory(&TestFormatters {})
Use either named bindings:
let test_formatters = TestFormatters {};
let options = FormatOptions::new().with_formatter_factory(&test_formatters);Or a static:
static TEST_FORMATTERS: TestFormatters = TestFormatters {};
let options = FormatOptions::new().with_formatter_factory(&TEST_FORMATTERS);🤖 Prompt for AI Agents
arrow-cast/src/pretty.rs lines 275-292; in tests around lines 1384, 1430, and
1501 replace temporary references like .with_formatter_factory(&TestFormatters
{}) with a long-lived binding: either create a local named binding (let
test_formatters = TestFormatters {}; then pass &test_formatters to
with_formatter_factory) or declare a static instance (static TEST_FORMATTERS:
TestFormatters = TestFormatters {}; then pass &TEST_FORMATTERS) so the
FormatOptions<'a> does not hold a reference to a dropped temporary.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! The formatter factory is supposed to live at least the lifetime of the FormatOptions, so using the 'a lifetime should work.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The implementations of Hash and PartialEq need to be synchronized. Prevents bugs when instances of such structs are used HashMap/HashSet collections which use buckets for the items with collisions. |
8829: To review by AI