-
Notifications
You must be signed in to change notification settings - Fork 1.1k
arrow-cast: Bring back in-order field casting for StructArray
#9007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
arrow-cast/src/cast/mod.rs
Outdated
| Ok(casted_fields) => casted_fields, | ||
| Err(e) => { | ||
| // If it's Field not found, we cast field by field | ||
| if !e.to_string().starts_with("Field '") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, let me know if you have a better idea. The only other idea I had was to build a hashmap of indexes and field names first, but didn't like that because I didn't want to introduce allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with this. But it requires an additional cast 🤔
This type of thing is perhaps why @tustvold suggested initially keeping the struct casting in arrow-rs simple and making something more complicated in datafusion (that can handle this, and more schema evolution things in DataFusion 🤔 )
fn cast_struct(
array: &StructArray,
from_fields: &Fields,
to_fields: &Fields,
cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let fields = cast_struct_inner(array.columns(), from_fields, to_fields, cast_options)?;
let array = StructArray::try_new(to_fields.clone(), fields, array.nulls().cloned())?;
Ok(Arc::new(array) as ArrayRef)
}
fn cast_struct_inner(
columns: &[ArrayRef],
from_fields: &Fields,
to_fields: &Fields,
cast_options: &CastOptions,
) -> Result<Vec<ArrayRef>, ArrowError> {
// Fast path: if field names are in the same order, we can just zip and cast
let fields_match_order = from_fields.len() == to_fields.len()
&& from_fields
.iter()
.zip(to_fields.iter())
.all(|(f1, f2)| f1.name() == f2.name());
if fields_match_order {
// Fast path: cast columns in order
return columns
.iter()
.zip(to_fields.iter())
.map(|(column, field)| {
cast_with_options(column, field.data_type(), cast_options)
})
.collect();
}
// Slow path 1: match fields by name and reorder
let mut result = Vec::with_capacity(to_fields.len());
let mut first_missing_field = None;
for to_field in to_fields {
let Some(from_field_idx) = from_fields
.iter()
.position(|from_field| from_field.name() == to_field.name()) else {
first_missing_field = Some(to_field);
break;
};
result.push(cast_with_options(&columns[from_field_idx], to_field.data_type(), cast_options)?);
}
let Some(first_missing_field) = first_missing_field else {
return Ok(result);
};
// slow path 2: field names don't match, try to cast field by field
// TODO avoid recasting fields that were already casted
result.clear();
for (l, field) in columns.iter().zip(to_fields) {
let cast_field = cast_with_options(l, field.data_type(), cast_options)
.map_err(|e| ArrowError::CastError(
format!("Field '{}' not found in source struct, and failed position casting of '{}': {}",
first_missing_field.name(),
field.name(),
e)))?;
result.push(cast_field)
}
Ok(result)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR also may recast some columns twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think recasting multiple times is the thing we should avoid at all cost. I changed the approach a bit and I think I'm happy enough with this. Now the worst thing that can happen is that we try to match names twice.
StructArray
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brancz
arrow-cast/src/cast/mod.rs
Outdated
| Ok(casted_fields) => casted_fields, | ||
| Err(e) => { | ||
| // If it's Field not found, we cast field by field | ||
| if !e.to_string().starts_with("Field '") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with this. But it requires an additional cast 🤔
This type of thing is perhaps why @tustvold suggested initially keeping the struct casting in arrow-rs simple and making something more complicated in datafusion (that can handle this, and more schema evolution things in DataFusion 🤔 )
fn cast_struct(
array: &StructArray,
from_fields: &Fields,
to_fields: &Fields,
cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let fields = cast_struct_inner(array.columns(), from_fields, to_fields, cast_options)?;
let array = StructArray::try_new(to_fields.clone(), fields, array.nulls().cloned())?;
Ok(Arc::new(array) as ArrayRef)
}
fn cast_struct_inner(
columns: &[ArrayRef],
from_fields: &Fields,
to_fields: &Fields,
cast_options: &CastOptions,
) -> Result<Vec<ArrayRef>, ArrowError> {
// Fast path: if field names are in the same order, we can just zip and cast
let fields_match_order = from_fields.len() == to_fields.len()
&& from_fields
.iter()
.zip(to_fields.iter())
.all(|(f1, f2)| f1.name() == f2.name());
if fields_match_order {
// Fast path: cast columns in order
return columns
.iter()
.zip(to_fields.iter())
.map(|(column, field)| {
cast_with_options(column, field.data_type(), cast_options)
})
.collect();
}
// Slow path 1: match fields by name and reorder
let mut result = Vec::with_capacity(to_fields.len());
let mut first_missing_field = None;
for to_field in to_fields {
let Some(from_field_idx) = from_fields
.iter()
.position(|from_field| from_field.name() == to_field.name()) else {
first_missing_field = Some(to_field);
break;
};
result.push(cast_with_options(&columns[from_field_idx], to_field.data_type(), cast_options)?);
}
let Some(first_missing_field) = first_missing_field else {
return Ok(result);
};
// slow path 2: field names don't match, try to cast field by field
// TODO avoid recasting fields that were already casted
result.clear();
for (l, field) in columns.iter().zip(to_fields) {
let cast_field = cast_with_options(l, field.data_type(), cast_options)
.map_err(|e| ArrowError::CastError(
format!("Field '{}' not found in source struct, and failed position casting of '{}': {}",
first_missing_field.name(),
field.name(),
e)))?;
result.push(cast_field)
}
Ok(result)
}d2a8ae2 to
1a0ef18
Compare
1a0ef18 to
9993672
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @brancz -- this looks good to me
arrow-cast/src/cast/mod.rs
Outdated
|
|
||
| let fields = if fields_match_order { | ||
| // Fast path: cast columns in order | ||
| // Fast path: cast columns in order if their names match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor suggestion would be to pull this code into its own function to make it easier to read (as this giant match statement is already pretty big)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! That also showed the possibility of deduplicating the "cast in order" code that's used in two of the branches.
|
I re-tested this change against the DataFusion test cases in apache/datafusion#19355 and I think it is working well. Thank you @brancz |
Which issue does this PR close?
Closes #9005
Rationale for this change
Not break something in a patch release.
What changes are included in this PR?
Bring back in-order casting for structs that have equal field numbers.
Are these changes tested?
Yes, the tests that were modified in #8871 were reverted back.
Are there any user-facing changes?
It brings back functionality.