Skip to content

Conversation

@bhanreddy1973
Copy link

Rationale for this change

Spark's explode_outer function treats empty arrays [] the same as NULL arrays - both produce an output row with NULL. Currently, DataFusion's unnest with preserve_nulls=true only handles NULL arrays, and empty arrays produce no output rows.

This makes it tricky to achieve Spark-compatible behavior when migrating workloads.

What changes are included in this PR?

Added a new preserve_empty_as_null flag to UnnestOptions:

  • When false (default): empty arrays produce 0 rows (existing behavior, backwards compatible)
  • When true: empty arrays produce 1 row with NULL value (Spark's explode_outer behavior)

Example with preserve_nulls=true and preserve_empty_as_null=true:

Input:

Column 1 Column 2
{1, 2} A
null B
{} C
{3} D

Output:

Column 1 Column 2
1 A
2 A
null B
null C
3 D

Files changed:

  • datafusion/common/src/unnest.rs - added the new option and builder method
  • datafusion/physical-plan/src/unnest.rs - updated find_longest_length() to handle empty arrays
  • datafusion/proto/* - updated proto definitions for serialization

How are these changes tested?

Added new unit test test_longest_list_length_preserve_empty_as_null that verifies:

  • Empty arrays get length 1 when the flag is enabled
  • NULL arrays still behave correctly based on preserve_nulls setting
  • The two flags work independently

Are these changes safe?

Yes - the default value is false, so existing behavior is unchanged. Users have to explicitly opt-in to the new behavior.

…ompatibility

This PR adds a new flag to UnnestOptions that treats empty arrays
the same as NULL arrays (both produce an output row with NULL).
This enables Spark-compatible explode_outer behavior.

Closes apache#19053
@github-actions github-actions bot added common Related to common crate proto Related to proto crate physical-plan Changes to the physical-plan crate labels Dec 31, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation for preserve_nulls states:

/// If `preserve_nulls` is false, nulls and empty lists
/// from the input column are not carried through to the output. This
/// is the default behavior for other systems such as ClickHouse and
/// DuckDB
///
/// If `preserve_nulls` is true (the default), nulls from the input
/// column are carried through to the output.

So what is the expected behaviour if preserve_nulls is false (nulls and empty lists aren't carried over) but preserve_empty_as_null is true?

@bhanreddy1973
Copy link
Author

Good question @Jefffrey!

If preserve_nulls = false and preserve_empty_as_null = true:

  • Empty arrays [] are first treated as NULL
  • Then, since preserve_nulls = false, NULLs produce 0 output rows
  • So empty arrays would produce 0 rows

The two flags work independently:

  1. preserve_empty_as_null converts [] → NULL
  2. preserve_nulls determines if NULLs produce output
preserve_nulls preserve_empty_as_null Empty [] produces
true true 1 row (NULL)
true false 0 rows
false true 0 rows
false false 0 rows

Should I add a test case for this combination, or update the documentation to clarify?

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 2, 2026

Good question @Jefffrey!

If preserve_nulls = false and preserve_empty_as_null = true:

* Empty arrays `[]` are first treated as NULL

* Then, since `preserve_nulls = false`, NULLs produce 0 output rows

* So empty arrays would produce 0 rows

The two flags work independently:

1. `preserve_empty_as_null` converts `[]` → NULL

2. `preserve_nulls` determines if NULLs produce output

preserve_nulls preserve_empty_as_null Empty [] produces
true true 1 row (NULL)
true false 0 rows
false true 0 rows
false false 0 rows

Should I add a test case for this combination, or update the documentation to clarify?

Having two flags interact in this way doesn't seem intuitive; we might be better off with an enum approach so we have more obvious behaviour states.

I also have to ask, how much of this PR (code, PR body + replies) are LLM generated? Please disclose if LLMs are being used here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants