-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: NULL handling in arrow_intersect and arrow_union #19415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @Jefffrey 👋🏻 I think PR should be ready for review, as you had left the message in issue, can you also review it please? |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this, the fix makes sense.
I do wonder about the null semantics at play here though; it seems even if both arrays are null we still return an empty list. Interestingly, DuckDB is a bit inconsistent:
D CREATE TABLE array_intersect_table_1D_NULL
AS VALUES
([1, 2, 2, 3], [2, 3, 4]),
([2, 3, 3], [3]),
([3], [3, 3, 4]),
(null, [3, 4]),
([1, 2], null),
(null, null)
;
D select array_intersect(col0, col1)
from array_intersect_table_1D_NULL;
┌─────────────────────────────┐
│ array_intersect(col0, col1) │
│ int32[] │
├─────────────────────────────┤
│ [3, 2] │
│ [3] │
│ [3] │
│ NULL │
│ [] │
│ NULL │
└─────────────────────────────┘- It seems DuckDB will return null based on if the first array being intersected is null
Meanwhile for Spark:
>>> spark.sql("select array_intersect(cast(null as array<string>), array())").show()
+------------------------------+
|array_intersect(NULL, array())|
+------------------------------+
| NULL|
+------------------------------+
>>> spark.sql("select array_intersect(cast(null as array<string>), cast(null as array<string>))").show()
+---------------------------+
|array_intersect(NULL, NULL)|
+---------------------------+
| NULL|
+---------------------------+
>>> spark.sql("select array_intersect(array(), cast(null as array<string>))").show()
+------------------------------+
|array_intersect(array(), NULL)|
+------------------------------+
| NULL|
+------------------------------+
>>> spark.sql("select array_intersect(array(), array())").show()
+---------------------------------+
|array_intersect(array(), array())|
+---------------------------------+
| []|
+---------------------------------+- If either array is null, it returns null
I lean towards what Spark would do here; would it be a big change to see if we can implement this semantic?
Can definitely give it a try! |
|
Hey @Jefffrey 👋🏻 Upon looking into this a bit more, it seems one of the limitations to returning We can't change this function as its widely used by a lot of functions, dumping list I received here: This raises two points in my head:
Either way, this seems like it needs a new PR, I can add |
|
For Seems consistent with above result: And current behaviour of i.e. follows array same like |
Could you elaborate on this point? I'm having trouble seeing how it is related to the issue at hand 🤔 |
I was talking about the case of making the behaviour same as spark, for that we would need to return Am I still missing your question? |
I still don't quite follow; what I mean is we return a null value instead of an empty array value. Keep in mind we are dealing with ListArrays, so each element is a list itself (or null). There is nothing about |
|
I was trying to write a detailed writeup on how I think this change would be made, and I think I now understand why you were confused by my initial observation. I must have read something wrong at that time, it is clear to me now. But this small exercise still leaves one doubt which still holds true, if I make a change in handling of args with |
|
I think we only need to look at the
The last argument there, See the distinct function in the same file: datafusion/datafusion/functions-nested/src/set_ops.rs Lines 549 to 559 in b818f93
But in our case, we would need to consider the nulls of both input arrays; I recommend exploring the APIs available to see how this would be possible. |
Which issue does this PR close?
array_unionandarray_intersectcannot handle NULL columnar data #9706Rationale for this change
Mentioned in issue.
What changes are included in this PR?
RowConverterAPI seems to have changed since initial implementation. Columns needs to be exactly same as schema passed in, so when we were handling case ofl_valuesorr_valuesbeingNone, us passing empty columns was causing an assert to fail inRowConverter.Are these changes tested?
cargo test --test sqllogictests -- arraypassesFor query mentioned in original issue, current result is:
Are there any user-facing changes?
No