bug: incorrect Parquet INT96 values from ArrowReader

### Describe the bug

iceberg-rust reads INT96 timestamps incorrectly, resulting in ~1170 year offset for dates outside the nanosecond i64 range (~1677-2262).

**Example:**
- Correct (Iceberg Java): `3332-12-14 11:33:10.965`
- iceberg-rust: `2163-11-05 13:24:03.545896`

This affects migrated tables where Parquet files were written with INT96 timestamps (common for Spark/Hive migrations via `add_files` or `importSparkTable`).

### Root Cause

#### INT96 in Parquet
INT96 is 12 bytes: 8 bytes of nanoseconds-within-day + 4 bytes of Julian day number.

#### What happens today

1. arrow-rs defaults INT96 to `Timestamp(Nanosecond, None)` ([`parquet/src/arrow/schema/primitive.rs:122`](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L122))
2. For dates outside ~1677-2262, nanoseconds-since-epoch overflows i64, producing garbage values
3. iceberg-rust's `RecordBatchTransformer` later casts to `Timestamp(Microsecond)` to match the Iceberg schema, but the data is already corrupted by overflow
4. [arrow-rs PR #7285](https://github.com/apache/arrow-rs/pull/7285) added support for reading INT96 as other TimeUnits — if you pass `Timestamp(Microsecond)` via `ArrowReaderOptions::with_schema()`, arrow-rs converts correctly without overflow

#### Why iceberg-rust doesn't pass the right schema hint

In `reader.rs`, the schema is only overridden via `ArrowReaderOptions::with_schema()` when Parquet files lack field IDs (branches 2/3 of the schema resolution strategy). Even then, the overridden schema is derived from the Parquet file metadata — which has
`Timestamp(Nanosecond)` for INT96 columns — not from the Iceberg table schema which correctly specifies `Timestamp(Microsecond)`.

For files with embedded field IDs (branch 1), no schema override is passed at all.

### How Iceberg Java handles this

Iceberg Java avoids this entirely by using a **custom INT96 column reader** that bypasses parquet-mr's default decoding. The reader factory receives the Iceberg expected schema as the authority via `readerFuncWithSchema.apply(expectedSchema, fileType)` ([`Parquet.java:1366-1371`](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L1366-L1371)).

When `BaseParquetReaders.primitive()` encounters INT96, it dispatches to a `TimestampInt96Reader` that reads the raw 12 bytes and converts safely:

```java
// GenericParquetReaders.java:172-191
final ByteBuffer byteBuffer = column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);
final long timeOfDayNanos = byteBuffer.getLong();
final int julianDay = byteBuffer.getInt();

return Instant.ofEpochMilli(TimeUnit.DAYS.toMillis(julianDay - UNIX_EPOCH_JULIAN))
    .plusNanos(timeOfDayNanos)
    .atOffset(ZoneOffset.UTC);
```

This avoids overflow by keeping days and nanos separate — it never tries to cram the full value into a single i64 nanoseconds-since-epoch.

iceberg-rust can't easily replicate this custom column reader approach since it delegates to arrow-rs for Parquet reading. The equivalent fix is to pass the correct schema hint so arrow-rs decodes INT96 as microseconds.

### Proposed Fix

When building the Arrow schema to pass to `ArrowReaderOptions::with_schema()`, overlay the Iceberg table schema's timestamp types onto the Parquet-derived schema. For any column where:
- The Parquet physical type is INT96
- The Iceberg type is Timestamp or Timestamptz

Replace `Timestamp(Nanosecond, ...)` with `Timestamp(Microsecond, ...)` in the schema hint. This triggers arrow-rs's INT96 conversion logic from [PR #7285](https://github.com/apache/arrow-rs/pull/7285).

This is the same approach DataFusion uses via its `coerce_int96_to_resolution()` function ([datafusion PR #15537](https://github.com/apache/datafusion/pull/15537)), except the source of truth for the target TimeUnit is the Iceberg schema rather than a user config.

#### Files to modify

1. `crates/iceberg/src/arrow/reader.rs`
   - After building the Arrow schema from Parquet metadata, walk INT96 timestamp columns and replace their types with the Iceberg schema's timestamp type
   - This applies to all three branches of the schema resolution strategy (with/without field IDs, with/without name mapping)

### Related

- [arrow-rs #7285](https://github.com/apache/arrow-rs/pull/7285): Support different TimeUnits and timezones when reading Timestamps from INT96
- [datafusion #15537](https://github.com/apache/datafusion/pull/15537): INT96 handling in DataFusion
- [datafusion-comet #3856](https://github.com/apache/datafusion-comet/issues/3856): Downstream issue in Comet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: incorrect Parquet INT96 values from ArrowReader #2299

Describe the bug

Root Cause

INT96 in Parquet

What happens today

Why iceberg-rust doesn't pass the right schema hint

How Iceberg Java handles this

Proposed Fix

Files to modify

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: incorrect Parquet INT96 values from ArrowReader #2299

Description

Describe the bug

Root Cause

INT96 in Parquet

What happens today

Why iceberg-rust doesn't pass the right schema hint

How Iceberg Java handles this

Proposed Fix

Files to modify

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions