8759: AvroError enum for arrow-avro crate#16
Conversation
WalkthroughThis PR introduces centralized error handling for the arrow-avro crate by creating a new AvroError enum and Result type alias in errors.rs. Multiple reader and writer modules are refactored to replace ArrowError with AvroError in public and internal signatures throughout the crate, enabling consistent error propagation and handling. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: AvroError Enum for arrow-avro CrateSummaryThis PR introduces a dedicated Code Quality & Best Practices✅ Strengths
|
There was a problem hiding this comment.
Bug: Wrong Index Corrupts Sync Marker Data
The sync marker write index calculation is incorrect. The code uses 16 - to_decode instead of 16 - self.bytes_remaining to determine where to write bytes in the 16-byte sync marker array. This causes bytes to be written to the wrong positions. For example, when bytes_remaining = 16 and to_decode = 5, bytes are written to positions 11-15 instead of 0-4, corrupting the sync marker data.
arrow-avro/src/reader/block.rs#L114-L116
arrow-rs/arrow-avro/src/reader/block.rs
Lines 114 to 116 in b40bb9c
| fn from(e: AvroError) -> Self { | ||
| match e { | ||
| AvroError::External(inner) => ArrowError::from_external_error(inner), | ||
| AvroError::ArrowError(inner) => ArrowError::from_external_error(inner), |
There was a problem hiding this comment.
Bug: Avro Error Conversion: Prevent Redundant Wrapping
When converting AvroError::ArrowError(inner) to ArrowError, the code calls ArrowError::from_external_error(inner) which wraps the boxed ArrowError as an external error. This creates unnecessary error nesting and loses the original error type. The correct approach is to unbox and return the ArrowError directly with *inner instead of wrapping it as an external error.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Bugbot AI reviewer is correct that there is no need to wrap an ArrowError in another ArrowError. Prevents a needless allocation and duplication.
| err, | ||
| ArrowError::ParseError(msg) | ||
| if msg.contains("Unexpected EOF") | ||
| AvroError::EOF(_) | AvroError::NeedMoreData(_) | AvroError::NeedMoreDataRange(_) |
There was a problem hiding this comment.
The new is_incomplete_data only matches EOF and NeedMoreData*, but some sites still return AvroError::ParseError for incomplete input (e.g., read_header returns ParseError("Unexpected EOF while reading Avro header")), which will no longer be treated as incomplete. Consider aligning those call sites to use EOF or broadening this check.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct that this should be an AvroError::EOF error instead of an AvroError::ParseError. This will help is_incomplete_data() to detect it!
| let writer_schema = hdr | ||
| .schema() | ||
| .map_err(|e| ArrowError::ExternalError(Box::new(e)))? | ||
| .map_err(|e| AvroError::External(Box::new(e)))? |
There was a problem hiding this comment.
hdr.schema() already returns AvroError; wrapping it in AvroError::External(Box::new(e)) loses the original variant and source. Propagating the AvroError directly preserves error classification and message.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct that the error is already an AvroError instance, so there is no need to wrap it in another AvroError. Prevents useless complexity.
| return out | ||
| .write_all(&src_be[extra..]) | ||
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e)); | ||
| .map_err(|e| AvroError::General(format!("write decimal fixed: {e}"))); |
There was a problem hiding this comment.
Mapping write_all errors to AvroError::General drops the underlying io::Error source; preserving it (e.g., via ? or an External variant) would retain diagnostics (also applies to similar write_all calls below).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct that by using the Display implementation of the error it will discard its source (std::io::Error). Prevents losing an important information which might be helpful to resolve the root cause of an error.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
arrow-avro/src/reader/mod.rs (1)
679-707: Blocker: AvroError → ArrowError conversion is relying on implicit From/Into; add explicit mapper and use it at all boundaries.Several places assume
ArrowError::from(AvroError)/e.into()/?will work. That impl can’t live in this crate (orphan rules), so this will fail or be brittle. Define a local converter and usemap_err(...)where public API returnsArrowError.Apply the following changes:
- Add a small helper (just below
is_incomplete_data):fn is_incomplete_data(err: &AvroError) -> bool { matches!(err, AvroError::EOF(_) | AvroError::NeedMoreData(_) | AvroError::NeedMoreDataRange(_)) } + +// Convert AvroError to ArrowError for public APIs that must return ArrowError. +fn to_arrow_error(err: AvroError) -> ArrowError { + match err { + AvroError::ArrowError(e) => *e, + // Default: surface as parse error with message + other => ArrowError::ParseError(other.to_string()), + } +}
- Decoder::decode: avoid
e.into()and?on AvroError:- Err(e) => return Err(e.into()), + Err(e) => return Err(to_arrow_error(e)), ... - match self.handle_prefix(&data[total_consumed..])? { + match self + .handle_prefix(&data[total_consumed..]) + .map_err(to_arrow_error)? {
- Decoder::flush:
- batch.map_err(ArrowError::from) + batch.map_err(to_arrow_error)
- ReaderBuilder::build:
- let header = read_header(&mut reader)?; - let decoder = self.make_decoder(Some(&header), self.reader_schema.as_ref())?; + let header = read_header(&mut reader).map_err(to_arrow_error)?; + let decoder = self + .make_decoder(Some(&header), self.reader_schema.as_ref()) + .map_err(to_arrow_error)?;
- ReaderBuilder::build_decoder:
- self.make_decoder(None, self.reader_schema.as_ref()) - .map_err(ArrowError::from) + self.make_decoder(None, self.reader_schema.as_ref()) + .map_err(to_arrow_error)
- Reader::read (all AvroError-returning calls):
- let consumed = self.block_decoder.decode(buf)?; + let consumed = self.block_decoder.decode(buf).map_err(to_arrow_error)?; ... - self.block_data = if let Some(ref codec) = self.header.compression()? { - codec.decompress(&block.data)? + self.block_data = if let Some(ref codec) = self.header.compression().map_err(to_arrow_error)? { + codec.decompress(&block.data).map_err(to_arrow_error)? } else { block.data }; ... - let (consumed, records_decoded) = self - .decoder - .decode_block(&self.block_data[self.block_cursor..], self.block_count)?; + let (consumed, records_decoded) = self + .decoder + .decode_block(&self.block_data[self.block_cursor..], self.block_count) + .map_err(to_arrow_error)?; ... - self.decoder.flush_block().map_err(ArrowError::from) + self.decoder.flush_block().map_err(to_arrow_error)This makes conversions explicit and compilation-safe.
Also applies to: 834-839, 1149-1151, 1171-1179, 1219-1256
🧹 Nitpick comments (2)
arrow-avro/src/reader/mod.rs (2)
1013-1020: Don’t wrap an AvroError into AvroError::External here.
Header::schema()already returnsResult<_, AvroError>, with parse context. Wrapping that AvroError asExternalloses semantics.Use
?directly:- let writer_schema = hdr - .schema() - .map_err(|e| AvroError::External(Box::new(e)))? + let writer_schema = hdr + .schema()? .ok_or_else(|| { AvroError::ParseError("No Avro schema present in file header".into()) })?;
1045-1069: Small tidy: avoid re-callingstore.fingerprints()and reuse the snapshot.You collect
let fingerprints = store.fingerprints();but iteratefor fingerprint in store.fingerprints(). Iterate the snapshot to avoid redundant calls and ensure a consistent snapshot.- let mut cache = IndexMap::with_capacity(fingerprints.len().saturating_sub(1)); - let mut active_decoder: Option<RecordDecoder> = None; - for fingerprint in store.fingerprints() { + let mut cache = IndexMap::with_capacity(fingerprints.len().saturating_sub(1)); + let mut active_decoder: Option<RecordDecoder> = None; + for fingerprint in fingerprints.iter().copied() {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
arrow-avro/src/errors.rs(1 hunks)arrow-avro/src/lib.rs(1 hunks)arrow-avro/src/reader/block.rs(3 hunks)arrow-avro/src/reader/cursor.rs(2 hunks)arrow-avro/src/reader/header.rs(6 hunks)arrow-avro/src/reader/mod.rs(18 hunks)arrow-avro/src/reader/record.rs(81 hunks)arrow-avro/src/writer/encoder.rs(87 hunks)arrow-avro/src/writer/format.rs(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
arrow-avro/src/reader/block.rs (2)
arrow-avro/src/reader/header.rs (1)
decode(178-268)arrow-avro/src/reader/mod.rs (1)
decode(678-708)
arrow-avro/src/reader/header.rs (3)
arrow-avro/src/reader/record.rs (4)
schema(147-149)decode(152-169)decode(853-991)decode(1536-1566)arrow-avro/src/schema.rs (4)
schema(362-365)schema(456-460)v(1274-1274)e(976-980)arrow-avro/src/reader/block.rs (1)
decode(78-127)
arrow-avro/src/reader/cursor.rs (1)
arrow-avro/src/reader/vlq.rs (1)
read_varint(50-61)
arrow-avro/src/writer/format.rs (2)
arrow-avro/src/reader/header.rs (2)
schema(116-124)compression(99-113)arrow-avro/src/writer/encoder.rs (1)
write_long(53-67)
arrow-avro/src/reader/record.rs (3)
arrow-avro/src/reader/block.rs (1)
decode(78-127)arrow-avro/src/reader/header.rs (1)
decode(178-268)arrow-avro/src/writer/encoder.rs (12)
try_new(1313-1364)try_new(1435-1464)try_new(1486-1508)try_new(1555-1580)try_new(1616-1641)try_new(1668-1693)b(2033-2035)new(725-731)new(1885-1887)new(1924-1938)build(740-766)build(850-1081)
arrow-avro/src/writer/encoder.rs (2)
arrow-avro/src/writer/mod.rs (13)
out(1501-1504)out(1547-1550)out(1587-1590)out(1673-1676)out(1732-1735)out(1777-1780)out(1821-1824)out(1865-1868)new(98-105)new(293-295)new(331-333)build(128-172)a(1679-1682)arrow-avro/src/errors.rs (1)
fmt(64-86)
arrow-avro/src/reader/mod.rs (2)
arrow-avro/src/errors.rs (5)
from(100-102)from(106-108)from(112-114)from(118-120)from(124-126)arrow-avro/src/reader/header.rs (1)
schema(116-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: codex
🔇 Additional comments (11)
arrow-avro/src/writer/format.rs (5)
21-21: LGTM! Import updated for new error model.The import correctly brings in the new
AvroErrorandResulttype alias, aligning with the centralized error handling introduced in this PR.
39-44: Public API change: trait method signature updated.The return type change from
Result<(), ArrowError>toResult<()>(using the newAvroError-based type alias) is a breaking change for any external implementors of this trait. This is expected as part of the error model migration.
116-128: LGTM! SOE compression validation updated correctly.The error construction now uses
AvroError::InvalidArgumentwith a clear, descriptive message. The validation logic remains unchanged and correctly rejects compression for Single-Object Encoding format.
136-145: LGTM! Helper functions updated consistently.Both
write_stringandwrite_bytesnow use the newResult<()>type alias with direct error propagation. The explicitOk(())at line 144 is clear and correct.
58-97: Fromio::Error for AvroError is correctly implemented.Verification confirms that
From<io::Error> for AvroErrorexists inarrow-avro/src/errors.rs, convertingio::ErrortoAvroError::External. The direct use of?onwrite_all()calls (lines 77, 95) will correctly propagate errors through the From trait implementation. Code is correct as written.arrow-avro/src/reader/mod.rs (1)
503-508: Confirm AvroError variants exist (EOF/NeedMoreData/NeedMoreDataRange).
is_incomplete_datamatches these variants. If any were renamed/removed inerrors.rs, the guard will never trigger and decoding will wrongly fail instead of asking for more bytes. Please confirm the variants’ exact names/fields.arrow-avro/src/reader/record.rs (4)
24-24: LGTM: Clean import of new error types.The import of
AvroErrorandResultfrom the errors module enables the consistent error handling refactoring throughout this file.Also applies to: 35-38
140-143: LGTM: Appropriate use of AvroError variants.The code uses distinct
AvroErrorvariants appropriately:
ParseErrorfor data parsing issues (e.g., invalid record type, decimal overflow)SchemaErrorfor schema validation issues (e.g., union branch count limits)NYIfor not-yet-implemented features (e.g., sparse unions)InvalidArgumentfor configuration errors (e.g., unsupported RunEndEncoded width)Error messages are descriptive and include relevant context.
Also applies to: 372-375, 388-391, 455-460, 466-471, 483-486, 496-500
4323-4338: LGTM: Test functions updated to use Result type.Test functions now return
Result<()>(which resolves toResult<(), AvroError>), enabling the use of the?operator for cleaner error propagation in tests. This is consistent with the centralized error handling approach.Also applies to: 4342-4366, 4370-4395, 4399-4427, 4430-4442
118-144: LGTM: Public API updated to use centralized error types with proper error conversion support.The public methods
try_new_with_options,decode, andflushnow returnResult<T>using the crate's error type alias. The error conversions at line 178 use.map_err(Into::into)which is supported by the confirmedFrom<ArrowError> for AvroErrorimplementation atarrow-avro/src/errors.rs:123.arrow-avro/src/writer/encoder.rs (1)
21-21: LGTM! Clean error handling migration.The refactoring from
ArrowErrortoAvroErrorthroughout this encoder module is well-executed and consistent. Key observations:
- All helper functions (
write_long,write_int,write_len_prefixed,write_bool,write_sign_extended) correctly returnResult<()>using the new error type- Error variant choices are appropriate:
InvalidArgumentfor validation/overflow,SchemaErrorfor type mismatches,NYIfor unimplemented features,Generalfor I/O errors- Overflow checks in time conversion functions (lines 1225-1227, 1238-1240) properly use
checked_mulwith descriptive error messages- All encoder constructors follow the
try_newpattern returningResult<Self>- Comprehensive test coverage validates the error type changes (tests explicitly check for
AvroError::InvalidArgumentandAvroError::SchemaError)The migration maintains the existing logic while improving error handling consistency across the crate.
Also applies to: 53-67, 70-72, 75-79, 82-85, 131-184, 191-199, 227-646, 740-836, 850-1082, 1143-1969
| impl From<AvroError> for ArrowError { | ||
| fn from(e: AvroError) -> Self { | ||
| match e { | ||
| AvroError::External(inner) => ArrowError::from_external_error(inner), | ||
| AvroError::ArrowError(inner) => ArrowError::from_external_error(inner), | ||
| other => ArrowError::AvroError(other.to_string()), | ||
| } |
There was a problem hiding this comment.
Preserve original ArrowError variant when converting back
AvroError::ArrowError(inner) currently re-emits as ArrowError::ExternalError, so any downstream code that previously matched on precise ArrowError variants (e.g., SchemaError, ComputeError) will now only see ExternalError, breaking existing error handling. Just return the boxed ArrowError you already own to keep behavior identical to the pre-refactor code.
impl From<AvroError> for ArrowError {
fn from(e: AvroError) -> Self {
match e {
AvroError::External(inner) => ArrowError::from_external_error(inner),
- AvroError::ArrowError(inner) => ArrowError::from_external_error(inner),
+ AvroError::ArrowError(inner) => *inner,
other => ArrowError::AvroError(other.to_string()),
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| impl From<AvroError> for ArrowError { | |
| fn from(e: AvroError) -> Self { | |
| match e { | |
| AvroError::External(inner) => ArrowError::from_external_error(inner), | |
| AvroError::ArrowError(inner) => ArrowError::from_external_error(inner), | |
| other => ArrowError::AvroError(other.to_string()), | |
| } | |
| impl From<AvroError> for ArrowError { | |
| fn from(e: AvroError) -> Self { | |
| match e { | |
| AvroError::External(inner) => ArrowError::from_external_error(inner), | |
| AvroError::ArrowError(inner) => *inner, | |
| other => ArrowError::AvroError(other.to_string()), | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In arrow-avro/src/errors.rs around lines 138 to 144, the
AvroError::ArrowError(inner) arm currently re-wraps the boxed ArrowError as an
ExternalError; change it to return the original boxed ArrowError (unbox/return
the inner ArrowError directly) so existing downstream pattern matches
(SchemaError, ComputeError, etc.) continue to work unchanged.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct that there is no need to wrap an ArrowError in another ArrowError. Prevents a needless allocation and duplication.
There was a problem hiding this comment.
@martin-augment, thank you for confirming! I'm glad the analysis is helpful. The suggestion to return *inner directly will indeed avoid the unnecessary re-wrapping and preserve the original ArrowError variant for downstream consumers. Feel free to apply the suggested change when you're ready to address it.
|
Findings
|
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct that there is no need to wrap an ArrowError in another ArrowError. Prevents a needless allocation and duplication. |
value:good-to-have; category:bug; feedback: The Codex AI reviewer is correct that there is no need to wrap an ArrowError in another ArrowError. Prevents a needless allocation and duplication. |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct that some methods wrongly still use ArrowError in their return type instead of the new specialized AvroError type. The finding prevents code inconsistency |
8759: To review by AI
Note
Adds a new
errorsmodule withAvroErrorand refactors reader/writer code to use it end-to-end, converting toArrowErroronly at public boundaries.errors::AvroErrorenum (e.g.,General,NYI,EOF,InvalidArgument,ParseError,SchemaError,External,NeedMoreData*) andResult<T>alias.From<io::Error/Utf8Error/FromUtf8Error/TryFromIntError/ArrowError> for AvroError, andFrom<AvroError> for io::Error/ArrowError.pub mod errors;inlib.rs.reader/*):ArrowErrorwithAvroErrorin internal APIs (HeaderDecoder,BlockDecoder,AvroCursor,RecordDecoder, union handling, projection/skipping, helpers).AvroError::{EOF,NeedMoreData*}; map toArrowErroronly in public returns (Decoder::flush,ReaderBuilder::build*_,Reader::read).writer/*):Result<_, AvroError>; map I/O and validation failures toAvroError.Result<_, AvroError>; public constructors/readers still exposeArrowErrorby converting fromAvroError.Written by Cursor Bugbot for commit b40bb9c. This will update automatically on new commits. Configure here.