fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema#12
fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema#12
Conversation
…ated struct/primitive with provided arrow schema closes: - apache#8495
…primitive-with-inferred-schema # Conflicts: # parquet/src/arrow/schema/complex.rs
…primitive-with-inferred-schema
WalkthroughEnhanced Parquet to Arrow schema conversion with hint-driven logic for repeated types and lists. Introduced a context flag to track list handling behavior, added a new helper function for list conversion, revised the Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
parquet/src/arrow/schema/complex.rs (4)
75-121: New into_list_with_arrow_list_hint: solid, matches semantics; a couple of nits
- Logic correctly copies element metadata (when hinted), keeps element non-nullable, and adds field-id only to the outer list via add_field_id=false. LGTM.
- Nits:
- Consider using the crate Result alias for consistency: Result instead of Result<Self, ParquetError>.
- The list-child extraction pattern appears multiple times across the file; a small helper (e.g., fn list_child(hint: &DataType) -> Option<&Field>) would DRY things up.
313-319: Propagate the repeated-as-list flag instead of forcing true for struct childrenHard-coding treat_repeated_as_list_arrow_hint: true for all struct children may change behavior when convert_type() (which starts with false) traverses nested structs. Prefer propagating the parent flag to avoid surprising conversions.
Apply this diff:
- let child_ctx = VisitorContext { + let child_ctx = VisitorContext { rep_level, def_level, data_type, - treat_repeated_as_list_arrow_hint: true, + treat_repeated_as_list_arrow_hint: context.treat_repeated_as_list_arrow_hint, };If you intended to always enable list-hint unwrapping under structs, please add a brief comment explaining why, and consider a targeted test for convert_type() on nested repeated fields without hints.
656-694: convert_field: consider applying extension metadata even when a hint is presentIn the Some(hint) branch you preserve dict metadata and copy hint metadata, but you don’t run try_add_extension_type. If the hint lacks extension metadata derivable from parquet_type (e.g., logical/extension types), you may miss it.
Suggestion: call try_add_extension_type after merging hint metadata, so parquet-derived extension metadata is still applied unless the hint already specifies it.
Apply this diff:
- Some(hint) => { + Some(hint) => { // If the inferred type is a dictionary, preserve dictionary metadata #[allow(deprecated)] - let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) { + let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) { (DataType::Dictionary(_, _), Some(id), Some(ordered)) => { #[allow(deprecated)] Field::new_dict(name, data_type, nullable, id, ordered) } _ => Field::new(name, data_type, nullable), }; - - Ok(field.with_metadata(hint.metadata().clone())) + // Merge hint metadata first, then attempt to add extension metadata from parquet_type + let merged = field.with_metadata(hint.metadata().clone()); + try_add_extension_type(merged, parquet_type) }If precedence should always favor the embedded Arrow schema, document that decision and add a test asserting no extension metadata is added when a hint is present.
739-1798: Tests: great coverage; consider a couple of extras
- Coverage is strong for back-compat lists/maps, nested repeated, field-id placement, and list type (List/LargeList/FixedSizeList) inference.
- Please add:
- A negative test asserting the specific error when a repeated field receives a non-list hint.
- A test for extension metadata behavior when a hint is present vs absent (to lock in the desired precedence).
If you want, I can sketch those tests quickly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
parquet/src/arrow/schema/complex.rs(16 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
parquet/src/arrow/schema/complex.rs (2)
arrow-schema/src/field.rs (4)
new(192-202)metadata(373-375)metadata(963-967)with_metadata(366-369)parquet/src/arrow/schema/primitive.rs (1)
convert_primitive(27-36)
🔇 Additional comments (14)
parquet/src/arrow/schema/complex.rs (14)
152-158: Context flag docs are clearThe flag and its docs make intent explicit. No code change needed.
201-219: Deriving primitive arrow type from list hints is correctFor REPEATED primitives with a list hint, unwrapping the inner field type before convert_primitive is the right call and surfaces helpful errors on mismatches.
Please confirm apply_hint in convert_primitive tolerates all element-level coercions you expect here (e.g., BYTE_ARRAY→Utf8, Decimal, etc.).
233-236: Choosing into_list_with_arrow_list_hint only when hints applyBranching to the new list helper only when the flag is set avoids behavior changes for callers that don’t use hints.
253-271: Struct element hint unwrapping looks rightUnwrapping the list hint to a Struct inner type and validating arity catches common schema mismatches early.
321-325: Preserving field IDs at struct childrenPassing add_field_id=true here ensures parquet field ids land on Arrow fields at the correct (outer) level. Matches tests.
342-345: Struct→list conversion with hints mirrors primitive pathConsistent with primitive handling; inner element gets no field-id, outer list does.
439-453: Map context propagation: key=false, value=trueCorrect: keys can’t be repeated; values may contain repeated structures requiring list-hint handling.
460-467: Field-id placement for map key/value
- Key: explicitly non-nullable and add_field_id=true.
- Value: add_field_id=true.
Both align with spec and your tests.
549-550: List primitive branch: disabling list-hint unwrapping is correctInside an explicit LIST, we shouldn’t unwrap a second list level unless the child itself is repeated—this respects the spec.
582-582: List one-tuple/compat branch: also correct to disable unwrappingSame rationale as above.
602-603: Enable list-hint unwrapping for nested list itemsTurning it back on for the item traversal allows nested lists to use LargeList/FixedSizeList hints.
606-608: add_field_id=true for list item fieldEnsures field-id is attached to the array field (outer list) not the element; aligns with tests asserting only the list carries the id.
711-716: Top-level: enable list-hint unwrappingSetting the flag to true at the root matches the PR goal of honoring embedded Arrow list forms.
728-734: convert_type: disable list-hint unwrappingCorrect: when no embedded Arrow schema is provided, don’t unwrap by default.
|
cursor review |
Pull Request Review: Backward Compatible Parquet Schema ConversionSummaryThis PR adds support for converting backward-compatible repeated Parquet fields to Arrow lists using embedded Arrow schema hints. The implementation handles repeated primitives, structs, and nested structures while properly managing field ID metadata propagation. Code Quality & Best Practices ✅Strengths:
Suggestions:
Potential Bugs & Issues
|
|
Review
|
8496: To review by AI
Note
Adds Arrow-hinted handling for repeated Parquet fields as lists (incl. nested), updates field-id propagation to list containers, and introduces extensive list/map compatibility tests.
treat_repeated_as_list_arrow_hint.ParquetField::into_list_with_arrow_list_hintto buildList/LargeList/FixedSizeListfrom hints.visit_primitive,visit_struct,visit_list,visit_map) to unwrap list hints, validate types, and construct appropriate Arrow types.convert_fieldnow takes&ParquetFieldand supportsadd_field_idto control metadata propagation.Written by Cursor Bugbot for commit 4d7485a. This will update automatically on new commits. Configure here.