-
Notifications
You must be signed in to change notification settings - Fork 446
bug: build_fallback_field_id_map produces incorrect column indices for schemas with nested types #2306
Description
Describe the bug
build_fallback_field_id_map maps Iceberg field IDs to wrong Parquet leaf column indices when the schema contains nested types (struct, list, map). This causes predicate evaluation to crash on migrated Parquet files (files without embedded field IDs).
Error:
"Leave column id in predicates isn't a root column in Parquet schema"
This affects migrated tables where Parquet files were written by Spark/Hive without Iceberg field IDs, then imported via add_files or importSparkTable().
Root Cause
How fallback field IDs work
When a Parquet file lacks embedded field IDs, iceberg-rust assigns position-based fallback IDs. Two functions must agree on the mapping:
add_fallback_field_ids_to_arrow_schema— assigns field IDs 1, 2, 3... to top-level Arrow schema fieldsbuild_fallback_field_id_map— maps those field IDs to Parquet leaf column indices for predicate evaluation
What goes wrong
build_fallback_field_id_map iterates over parquet_schema.columns() (leaf columns) instead of top-level fields. Nested types expand into multiple leaves,
causing the mapping to diverge from the Arrow schema's field IDs.
Example: name: string, address: struct(street: string, city: string), id: int
| Arrow top-level fields | Parquet leaf columns | |
|---|---|---|
| Fields | name, address, id | name, street, city, id |
| Assigned field IDs | 1, 2, 3 | 1, 2, 3, 4 (bug) |
When a predicate references id (field_id=3 from Arrow), the column map returns leaf index 2 (city, inside the address group). PredicateConverter::bound_reference then calls get_column_root(2).is_group() → true → error.
How Iceberg Java handles this
Java's ParquetSchemaUtil.addFallbackIds() iterates top-level fields, not leaf columns:
public static MessageType addFallbackIds(MessageType fileSchema) {
MessageTypeBuilder builder = org.apache.parquet.schema.Types.buildMessage();
int ordinal = 1;
for (Type type : fileSchema.getFields()) {
builder.addField(type.withId(ordinal));
ordinal += 1;
}
return builder.named(fileSchema.getName());
}Additionally, Java's https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java gracefully handles nested types — predicates on nested columns return ROWS_MIGHT_MATCH instead of crashing.
Proposed Fix
Change build_fallback_field_id_map to iterate over parquet_schema.root_schema().get_fields()`` (top-level fields) instead of parquet_schema.columns()`` (leaf columns).
For each top-level field:
- If primitive: map
ordinal→leaf_column_index - If group (struct/list/map): skip the mapping, advance the leaf counter past all leaves in that group
This makes build_fallback_field_id_map consistent with add_fallback_field_ids_to_arrow_schema, which already correctly iterates top-level Arrow fields.
PredicateConverter::bound_reference already validates that the resolved column is a root column and rejects groups, so no changes are needed there.
Files to modify
crates/iceberg/src/arrow/reader.rs — build_fallback_field_id_map
Related