Skip to content

Conversation

@brancz
Copy link
Contributor

@brancz brancz commented Dec 16, 2025

Which issue does this PR close?

Closes #9005

Rationale for this change

Not break something in a patch release.

What changes are included in this PR?

Bring back in-order casting for structs that have equal field numbers.

Are these changes tested?

Yes, the tests that were modified in #8871 were reverted back.

Are there any user-facing changes?

It brings back functionality.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 16, 2025
Ok(casted_fields) => casted_fields,
Err(e) => {
// If it's Field not found, we cast field by field
if !e.to_string().starts_with("Field '")
Copy link
Contributor Author

@brancz brancz Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, let me know if you have a better idea. The only other idea I had was to build a hashmap of indexes and field names first, but didn't like that because I didn't want to introduce allocations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with this. But it requires an additional cast 🤔

This type of thing is perhaps why @tustvold suggested initially keeping the struct casting in arrow-rs simple and making something more complicated in datafusion (that can handle this, and more schema evolution things in DataFusion 🤔 )

fn cast_struct(
    array: &StructArray,
    from_fields: &Fields,
    to_fields: &Fields,
    cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
    let fields = cast_struct_inner(array.columns(), from_fields, to_fields, cast_options)?;
    let array = StructArray::try_new(to_fields.clone(), fields, array.nulls().cloned())?;
    Ok(Arc::new(array) as ArrayRef)
}

fn cast_struct_inner(
        columns: &[ArrayRef],
        from_fields: &Fields,
        to_fields: &Fields,
        cast_options: &CastOptions,
    ) -> Result<Vec<ArrayRef>, ArrowError> {

        // Fast path: if field names are in the same order, we can just zip and cast
    let fields_match_order = from_fields.len() == to_fields.len()
        && from_fields
        .iter()
        .zip(to_fields.iter())
        .all(|(f1, f2)| f1.name() == f2.name());

    if fields_match_order {
        // Fast path: cast columns in order
        return columns
            .iter()
            .zip(to_fields.iter())
            .map(|(column, field)| {
                cast_with_options(column, field.data_type(), cast_options)
            })
            .collect();
    }

    // Slow path 1: match fields by name and reorder
    let mut result = Vec::with_capacity(to_fields.len());
    let mut first_missing_field = None;
    for to_field in to_fields {
        let Some(from_field_idx) = from_fields
            .iter()
            .position(|from_field| from_field.name() == to_field.name()) else {
            first_missing_field = Some(to_field);
            break;
        };
        result.push(cast_with_options(&columns[from_field_idx], to_field.data_type(), cast_options)?);
    }

    let Some(first_missing_field) = first_missing_field else {
        return Ok(result);
    };

    // slow path 2: field names don't match, try to cast field by field
    // TODO avoid recasting fields that were already casted
    result.clear();
    for (l, field) in columns.iter().zip(to_fields) {
        let cast_field = cast_with_options(l, field.data_type(), cast_options)
            .map_err(|e| ArrowError::CastError(
                format!("Field '{}' not found in source struct, and failed position casting of '{}': {}",
                    first_missing_field.name(),
                    field.name(),
                        e)))?;
        result.push(cast_field)
    }
    Ok(result)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR also may recast some columns twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think recasting multiple times is the thing we should avoid at all cost. I changed the approach a bit and I think I'm happy enough with this. Now the worst thing that can happen is that we try to match names twice.

@alamb alamb changed the title arrow-cast: Bring back in-order casting arrow-cast: Bring back in-order field casting for StructArray Dec 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brancz

Ok(casted_fields) => casted_fields,
Err(e) => {
// If it's Field not found, we cast field by field
if !e.to_string().starts_with("Field '")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with this. But it requires an additional cast 🤔

This type of thing is perhaps why @tustvold suggested initially keeping the struct casting in arrow-rs simple and making something more complicated in datafusion (that can handle this, and more schema evolution things in DataFusion 🤔 )

fn cast_struct(
    array: &StructArray,
    from_fields: &Fields,
    to_fields: &Fields,
    cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
    let fields = cast_struct_inner(array.columns(), from_fields, to_fields, cast_options)?;
    let array = StructArray::try_new(to_fields.clone(), fields, array.nulls().cloned())?;
    Ok(Arc::new(array) as ArrayRef)
}

fn cast_struct_inner(
        columns: &[ArrayRef],
        from_fields: &Fields,
        to_fields: &Fields,
        cast_options: &CastOptions,
    ) -> Result<Vec<ArrayRef>, ArrowError> {

        // Fast path: if field names are in the same order, we can just zip and cast
    let fields_match_order = from_fields.len() == to_fields.len()
        && from_fields
        .iter()
        .zip(to_fields.iter())
        .all(|(f1, f2)| f1.name() == f2.name());

    if fields_match_order {
        // Fast path: cast columns in order
        return columns
            .iter()
            .zip(to_fields.iter())
            .map(|(column, field)| {
                cast_with_options(column, field.data_type(), cast_options)
            })
            .collect();
    }

    // Slow path 1: match fields by name and reorder
    let mut result = Vec::with_capacity(to_fields.len());
    let mut first_missing_field = None;
    for to_field in to_fields {
        let Some(from_field_idx) = from_fields
            .iter()
            .position(|from_field| from_field.name() == to_field.name()) else {
            first_missing_field = Some(to_field);
            break;
        };
        result.push(cast_with_options(&columns[from_field_idx], to_field.data_type(), cast_options)?);
    }

    let Some(first_missing_field) = first_missing_field else {
        return Ok(result);
    };

    // slow path 2: field names don't match, try to cast field by field
    // TODO avoid recasting fields that were already casted
    result.clear();
    for (l, field) in columns.iter().zip(to_fields) {
        let cast_field = cast_with_options(l, field.data_type(), cast_options)
            .map_err(|e| ArrowError::CastError(
                format!("Field '{}' not found in source struct, and failed position casting of '{}': {}",
                    first_missing_field.name(),
                    field.name(),
                        e)))?;
        result.push(cast_field)
    }
    Ok(result)
}

@brancz brancz force-pushed the field-in-order-cast branch from d2a8ae2 to 1a0ef18 Compare December 17, 2025 09:43
@brancz brancz force-pushed the field-in-order-cast branch from 1a0ef18 to 9993672 Compare December 17, 2025 09:48
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brancz -- this looks good to me


let fields = if fields_match_order {
// Fast path: cast columns in order
// Fast path: cast columns in order if their names match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor suggestion would be to pull this code into its own function to make it easier to read (as this giant match statement is already pretty big)

Copy link
Contributor Author

@brancz brancz Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! That also showed the possibility of deduplicating the "cast in order" code that's used in two of the branches.

@alamb
Copy link
Contributor

alamb commented Dec 18, 2025

I re-tested this change against the DataFusion test cases in apache/datafusion#19355 and I think it is working well. Thank you @brancz

@alamb alamb merged commit 15b18c1 into apache:main Dec 18, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in struct casting in 57.2.0 (not yet released)

2 participants