Skip to content

Conversation

@feniljain
Copy link
Contributor

@feniljain feniljain commented Dec 19, 2025

Which issue does this PR close?

Rationale for this change

Mentioned in issue.

What changes are included in this PR?

RowConverter API seems to have changed since initial implementation. Columns needs to be exactly same as schema passed in, so when we were handling case of l_values or r_values being None, us passing empty columns was causing an assert to fail in RowConverter.

Are these changes tested?

cargo test --test sqllogictests -- array passes

For query mentioned in original issue, current result is:

> select array_intersect(column1, column2) from array_intersect_table;
+------------------------------------------------------------------------------+
| array_intersect(array_intersect_table.column1,array_intersect_table.column2) |
+------------------------------------------------------------------------------+
| [2, 3]                                                                       |
| [3]                                                                          |
| [3]                                                                          |
| []                                                                           |
| []                                                                           |
| []                                                                           |
+------------------------------------------------------------------------------+
6 row(s) fetched. 
Elapsed 0.012 seconds.

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 19, 2025
@feniljain feniljain changed the title wip(arrow_intersect, arrow_union): row converter panic wip: NULL handling in arrow_intersect and arrow_union Dec 19, 2025
@feniljain feniljain changed the title wip: NULL handling in arrow_intersect and arrow_union fix: NULL handling in arrow_intersect and arrow_union Dec 19, 2025
@feniljain feniljain marked this pull request as ready for review December 19, 2025 18:43
@feniljain
Copy link
Contributor Author

Hey @Jefffrey 👋🏻

I think PR should be ready for review, as you had left the message in issue, can you also review it please?

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, the fix makes sense.

I do wonder about the null semantics at play here though; it seems even if both arrays are null we still return an empty list. Interestingly, DuckDB is a bit inconsistent:

D CREATE TABLE array_intersect_table_1D_NULL
  AS VALUES
    ([1, 2, 2, 3], [2, 3, 4]),
    ([2, 3, 3], [3]),
    ([3], [3, 3, 4]),
    (null, [3, 4]),
    ([1, 2], null),
    (null, null)
  ;
D select array_intersect(col0, col1)
  from array_intersect_table_1D_NULL;
┌─────────────────────────────┐
│ array_intersect(col0, col1) │
│           int32[]           │
├─────────────────────────────┤
│ [3, 2]                      │
│ [3]                         │
│ [3]                         │
│ NULL                        │
│ []                          │
│ NULL                        │
└─────────────────────────────┘
  • It seems DuckDB will return null based on if the first array being intersected is null

Meanwhile for Spark:

>>> spark.sql("select array_intersect(cast(null as array<string>), array())").show()
+------------------------------+
|array_intersect(NULL, array())|
+------------------------------+
|                          NULL|
+------------------------------+

>>> spark.sql("select array_intersect(cast(null as array<string>), cast(null as array<string>))").show()
+---------------------------+
|array_intersect(NULL, NULL)|
+---------------------------+
|                       NULL|
+---------------------------+

>>> spark.sql("select array_intersect(array(), cast(null as array<string>))").show()
+------------------------------+
|array_intersect(array(), NULL)|
+------------------------------+
|                          NULL|
+------------------------------+

>>> spark.sql("select array_intersect(array(), array())").show()
+---------------------------------+
|array_intersect(array(), array())|
+---------------------------------+
|                               []|
+---------------------------------+
  • If either array is null, it returns null

I lean towards what Spark would do here; would it be a big change to see if we can implement this semantic?

@feniljain
Copy link
Contributor Author

I lean towards what Spark would do here; would it be a big change to see if we can implement this semantic?

Can definitely give it a try!

@feniljain
Copy link
Contributor Author

Hey @Jefffrey 👋🏻

Upon looking into this a bit more, it seems one of the limitations to returning NULL here stems from make_scalar_function expecting a function returning Result<ArrayRef> source

We can't change this function as its widely used by a lot of functions, dumping list I received here:

datafusion/functions-nested/src/array_has.rs|38 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/array_has.rs|541 col 9-29| make_scalar_function(array_has_all_inner)(&args.args)
datafusion/functions-nested/src/array_has.rs|615 col 9-29| make_scalar_function(array_has_any_inner)(&args.args)
datafusion/functions-nested/src/cardinality.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/cardinality.rs|112 col 9-29| make_scalar_function(cardinality_inner)(&args.args)
datafusion/functions-nested/src/concat.rs|24 col 61-81| use crate::utils::{align_array_dimensions, check_datatypes, make_scalar_function};
datafusion/functions-nested/src/concat.rs|124 col 9-29| make_scalar_function(array_append_inner)(&args.args)
datafusion/functions-nested/src/concat.rs|213 col 9-29| make_scalar_function(array_prepend_inner)(&args.args)
datafusion/functions-nested/src/concat.rs|333 col 9-29| make_scalar_function(array_concat_inner)(&args.args)
datafusion/functions-nested/src/dimension.rs|33 col 40-60| use crate::utils::{compute_array_dims, make_scalar_function};
datafusion/functions-nested/src/dimension.rs|108 col 9-29| make_scalar_function(array_dims_inner)(&args.args)
datafusion/functions-nested/src/dimension.rs|180 col 9-29| make_scalar_function(array_ndims_inner)(&args.args)
datafusion/functions-nested/src/distance.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/distance.rs|132 col 9-29| make_scalar_function(array_distance_inner)(&args.args)
datafusion/functions-nested/src/empty.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/empty.rs|101 col 9-29| make_scalar_function(array_empty_inner)(&args.args)
datafusion/functions-nested/src/except.rs|20 col 37-57| use crate::utils::{check_datatypes, make_scalar_function};
datafusion/functions-nested/src/except.rs|117 col 9-29| make_scalar_function(array_except_inner)(&args.args)
datafusion/functions-nested/src/extract.rs|51 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/extract.rs|179 col 9-29| make_scalar_function(array_element_inner)(&args.args)
datafusion/functions-nested/src/extract.rs|402 col 9-29| make_scalar_function(array_slice_inner)(&args.args)
datafusion/functions-nested/src/extract.rs|849 col 9-29| make_scalar_function(array_pop_front_inner)(&args.args)
datafusion/functions-nested/src/extract.rs|945 col 9-29| make_scalar_function(array_pop_back_inner)(&args.args)
datafusion/functions-nested/src/extract.rs|1050 col 9-29| make_scalar_function(array_any_value_inner)(&args.args)
datafusion/functions-nested/src/flatten.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/flatten.rs|121 col 9-29| make_scalar_function(flatten_inner)(&args.args)
datafusion/functions-nested/src/length.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/length.rs|124 col 9-29| make_scalar_function(array_length_inner)(&args.args)
datafusion/functions-nested/src/make_array.rs|24 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/make_array.rs|120 col 9-29| make_scalar_function(make_array_inner)(&args.args)
datafusion/functions-nested/src/map_entries.rs|20 col 41-61| use crate::utils::{get_map_entry_field, make_scalar_function};
datafusion/functions-nested/src/map_entries.rs|118 col 9-29| make_scalar_function(map_entries_inner)(&args.args)
datafusion/functions-nested/src/map_extract.rs|20 col 41-61| use crate::utils::{get_map_entry_field, make_scalar_function};
datafusion/functions-nested/src/map_extract.rs|117 col 9-29| make_scalar_function(map_extract_inner)(&args.args)
datafusion/functions-nested/src/map_keys.rs|20 col 41-61| use crate::utils::{get_map_entry_field, make_scalar_function};
datafusion/functions-nested/src/map_keys.rs|108 col 9-29| make_scalar_function(map_keys_inner)(&args.args)
datafusion/functions-nested/src/map_values.rs|20 col 41-61| use crate::utils::{get_map_entry_field, make_scalar_function};
datafusion/functions-nested/src/map_values.rs|118 col 9-29| make_scalar_function(map_values_inner)(&args.args)
datafusion/functions-nested/src/min_max.rs|19 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/min_max.rs|104 col 9-29| make_scalar_function(array_max_inner)(&args.args)
datafusion/functions-nested/src/min_max.rs|190 col 9-29| make_scalar_function(array_min_inner)(&args.args)
datafusion/functions-nested/src/position.rs|45 col 45-65| use crate::utils::{compare_element_to_list, make_scalar_function};
datafusion/functions-nested/src/position.rs|132 col 9-29| make_scalar_function(array_position_inner)(&args.args)
datafusion/functions-nested/src/position.rs|283 col 9-29| make_scalar_function(array_positions_inner)(&args.args)
datafusion/functions-nested/src/range.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/range.rs|266 col 17-37| make_scalar_function(|args| self.gen_range_inner(args))(args)
datafusion/functions-nested/src/range.rs|269 col 17-37| make_scalar_function(|args| self.gen_range_date(args))(args)
datafusion/functions-nested/src/range.rs|272 col 17-37| make_scalar_function(|args| self.gen_range_timestamp(args))(args)
datafusion/functions-nested/src/remove.rs|21 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/remove.rs|117 col 9-29| make_scalar_function(array_remove_inner)(&args.args)
datafusion/functions-nested/src/remove.rs|212 col 9-29| make_scalar_function(array_remove_n_inner)(&args.args)
datafusion/functions-nested/src/remove.rs|296 col 9-29| make_scalar_function(array_remove_all_inner)(&args.args)
datafusion/functions-nested/src/repeat.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/repeat.rs|122 col 9-29| make_scalar_function(array_repeat_inner)(&args.args)
datafusion/functions-nested/src/replace.rs|37 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/replace.rs|136 col 9-29| make_scalar_function(array_replace_inner)(&args.args)
datafusion/functions-nested/src/replace.rs|218 col 9-29| make_scalar_function(array_replace_n_inner)(&args.args)
datafusion/functions-nested/src/replace.rs|298 col 9-29| make_scalar_function(array_replace_all_inner)(&args.args)
datafusion/functions-nested/src/resize.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/resize.rs|143 col 9-29| make_scalar_function(array_resize_inner)(&args.args)
datafusion/functions-nested/src/reverse.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/reverse.rs|111 col 9-29| make_scalar_function(array_reverse_inner)(&args.args)
datafusion/functions-nested/src/set_ops.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/set_ops.rs|149 col 9-29| make_scalar_function(array_union_inner)(&args.args)
datafusion/functions-nested/src/set_ops.rs|234 col 9-29| make_scalar_function(array_intersect_inner)(&args.args)
datafusion/functions-nested/src/set_ops.rs|299 col 9-29| make_scalar_function(array_distinct_inner)(&args.args)
datafusion/functions-nested/src/sort.rs|20 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/sort.rs|155 col 9-29| make_scalar_function(array_sort_inner)(&args.args)
datafusion/functions-nested/src/string.rs|33 col 19-39| use crate::utils::make_scalar_function;
datafusion/functions-nested/src/string.rs|216 col 9-29| make_scalar_function(array_to_string_inner)(&args.args)
datafusion/functions-nested/src/string.rs|315 col 32-52| Utf8 | Utf8View => make_scalar_function(string_to_array_inner::<i32>)(args),
datafusion/functions-nested/src/string.rs|316 col 26-46| LargeUtf8 => make_scalar_function(string_to_array_inner::<i64>)(args),
datafusion/functions-nested/src/utils.rs|53 col 15-35| pub(crate) fn make_scalar_function<F>(

This raises two points in my head:

  • Should we check implementation of each of these functions above and see how is NULL handled in each of them?
  • If we decide to just fix array_intersect and array_union, we would have to create a new function similar to make_scalar_function which would can return NULL too

Either way, this seems like it needs a new PR, I can add array_union test and we can continue work on fixing support for all of these (I can file a new issue). What do you think?

@feniljain
Copy link
Contributor Author

For array_union I checked behaviour of spark:

spark-sql (default)> select array_union(cast(null as array<string>), array());
NULL
Time taken: 0.693 seconds, Fetched 1 row(s)
spark-sql (default)> select array_intersect(cast(null as array<string>), cast(null as array<string>));
NULL
Time taken: 0.058 seconds, Fetched 1 row(s)
spark-sql (default)> select array_intersect(array(), cast(null as array<string>));
NULL
Time taken: 0.075 seconds, Fetched 1 row(s)
spark-sql (default)> select array_intersect(array(), array());
[]
Time taken: 0.071 seconds, Fetched 1 row(s)

Seems consistent with above result: If either array is null, it returns null

And current behaviour of array_union with this PR:

query ?
select array_union(column1, column2)
from array_intersect_table_1D_NULL;
----
[1, 2, 3, 4]
[2, 3]
[3, 4]
[3, 4]
[1, 2]
[]

i.e. follows array same like array_intersect

@Jefffrey
Copy link
Contributor

Upon looking into this a bit more, it seems one of the limitations to returning NULL here stems from make_scalar_function expecting a function returning Result<ArrayRef> source

Could you elaborate on this point? I'm having trouble seeing how it is related to the issue at hand 🤔

@feniljain
Copy link
Contributor Author

Could you elaborate on this point? I'm having trouble seeing how it is related to the issue at hand 🤔

I was talking about the case of making the behaviour same as spark, for that we would need to return NULL itself rather than current behaviour of empty arrays. Just to be sure, I also checked documentation of Array it is not implemented by Null

Am I still missing your question?

@Jefffrey
Copy link
Contributor

Could you elaborate on this point? I'm having trouble seeing how it is related to the issue at hand 🤔

I was talking about the case of making the behaviour same as spark, for that we would need to return NULL itself rather than current behaviour of empty arrays. Just to be sure, I also checked documentation of Array it is not implemented by Null

Am I still missing your question?

I still don't quite follow; what I mean is we return a null value instead of an empty array value. Keep in mind we are dealing with ListArrays, so each element is a list itself (or null). There is nothing about make_scalar_function or the Array trait that should prevent this.

@feniljain
Copy link
Contributor Author

I was trying to write a detailed writeup on how I think this change would be made, and I think I now understand why you were confused by my initial observation. I must have read something wrong at that time, it is clear to me now.

But this small exercise still leaves one doubt which still holds true, if I make a change in handling of args with NULL here, that would mean making the same change for all of the functions I listed in this message here. Is that okay with you?

@Jefffrey
Copy link
Contributor

I think we only need to look at the set_ops.rs file that was modified in this PR; specifically this line:

let arr = GenericListArray::<OffsetSize>::try_new(field, offsets, values, None)?;

The last argument there, None, is usually where the null buffer would be provided, if any. By having it as None we essentially can never have nulls in the returned array, which is what we want to change.

See the distinct function in the same file:

let offsets = OffsetBuffer::new(offsets.into());
let new_arrays_ref = new_arrays.iter().map(|v| v.as_ref()).collect::<Vec<_>>();
let values = compute::concat(&new_arrays_ref)?;
Ok(Arc::new(GenericListArray::<OffsetSize>::try_new(
Arc::clone(field),
offsets,
values,
// Keep the list nulls
array.nulls().cloned(),
)?))
}

But in our case, we would need to consider the nulls of both input arrays; I recommend exploring the APIs available to see how this would be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_union and array_intersect cannot handle NULL columnar data

2 participants