Skip to content

bug: build_fallback_field_id_map produces incorrect column indices for schemas with nested types #2306

@mbutrovich

Description

@mbutrovich

Describe the bug

build_fallback_field_id_map maps Iceberg field IDs to wrong Parquet leaf column indices when the schema contains nested types (struct, list, map). This causes predicate evaluation to crash on migrated Parquet files (files without embedded field IDs).

Error:
"Leave column id in predicates isn't a root column in Parquet schema"

This affects migrated tables where Parquet files were written by Spark/Hive without Iceberg field IDs, then imported via add_files or importSparkTable().

Root Cause

How fallback field IDs work

When a Parquet file lacks embedded field IDs, iceberg-rust assigns position-based fallback IDs. Two functions must agree on the mapping:

  1. add_fallback_field_ids_to_arrow_schema — assigns field IDs 1, 2, 3... to top-level Arrow schema fields
  2. build_fallback_field_id_map — maps those field IDs to Parquet leaf column indices for predicate evaluation

What goes wrong

build_fallback_field_id_map iterates over parquet_schema.columns() (leaf columns) instead of top-level fields. Nested types expand into multiple leaves,
causing the mapping to diverge from the Arrow schema's field IDs.

Example: name: string, address: struct(street: string, city: string), id: int

Arrow top-level fields Parquet leaf columns
Fields name, address, id name, street, city, id
Assigned field IDs 1, 2, 3 1, 2, 3, 4 (bug)

When a predicate references id (field_id=3 from Arrow), the column map returns leaf index 2 (city, inside the address group). PredicateConverter::bound_reference then calls get_column_root(2).is_group()true → error.

How Iceberg Java handles this

Java's ParquetSchemaUtil.addFallbackIds() iterates top-level fields, not leaf columns:

public static MessageType addFallbackIds(MessageType fileSchema) {
    MessageTypeBuilder builder = org.apache.parquet.schema.Types.buildMessage();
    int ordinal = 1;
    for (Type type : fileSchema.getFields()) {
        builder.addField(type.withId(ordinal));
        ordinal += 1;
    }
    return builder.named(fileSchema.getName());
}

Additionally, Java's https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java gracefully handles nested types — predicates on nested columns return ROWS_MIGHT_MATCH instead of crashing.

Proposed Fix

Change build_fallback_field_id_map to iterate over parquet_schema.root_schema().get_fields()`` (top-level fields) instead of parquet_schema.columns()`` (leaf columns).
For each top-level field:

  • If primitive: map ordinalleaf_column_index
  • If group (struct/list/map): skip the mapping, advance the leaf counter past all leaves in that group

This makes build_fallback_field_id_map consistent with add_fallback_field_ids_to_arrow_schema, which already correctly iterates top-level Arrow fields.

PredicateConverter::bound_reference already validates that the resolved column is a root column and rejects groups, so no changes are needed there.

Files to modify

  1. crates/iceberg/src/arrow/reader.rs — build_fallback_field_id_map

Related

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions