Skip to content

Conversation

ilicmarkodb
Copy link
Contributor

@ilicmarkodb ilicmarkodb commented Jul 28, 2025

What changes were proposed in this pull request?

Fix Python UDF not accepting collated strings as input param/return type.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 2584dab to 2f1bee5 Compare July 28, 2025 14:35
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 2f1bee5 to 0f47248 Compare July 28, 2025 17:26
@ilicmarkodb ilicmarkodb requested a review from stefankandic July 28, 2025 17:29
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 0f47248 to 94be795 Compare July 28, 2025 17:33
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 2 times, most recently from cc8c888 to b2761fe Compare July 29, 2025 10:23
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from b2761fe to a1f4c93 Compare July 29, 2025 12:06
@HyukjinKwon HyukjinKwon changed the title [SPARK-52976][Python] Fix Python UDF not accepting collated string as input param [SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param Jul 29, 2025
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 11 times, most recently from 01ce67e to e1309ad Compare August 1, 2025 22:13
@ilicmarkodb ilicmarkodb changed the title [SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param [SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type Aug 2, 2025
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from 90d978d to 4e73996 Compare August 4, 2025 11:42
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from e5fa93c to b220cde Compare August 5, 2025 12:01
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from b594689 to 72237fa Compare August 5, 2025 13:05
cloud-fan pushed a commit that referenced this pull request Aug 6, 2025
…ypes

### What changes were proposed in this pull request?
Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata.

### Why are the changes needed?
Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth.
This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work.

### Does this PR introduce _any_ user-facing change?
Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object.

### How was this patch tested?
New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #51850 from stefankandic/fixStringJson.

Authored-by: Stefan Kandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Aug 6, 2025
…ypes

### What changes were proposed in this pull request?
Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata.

### Why are the changes needed?
Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth.
This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work.

### Does this PR introduce _any_ user-facing change?
Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object.

### How was this patch tested?
New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #51850 from stefankandic/fixStringJson.

Authored-by: Stefan Kandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 19ea6ff)
Signed-off-by: Wenchen Fan <[email protected]>
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 2 times, most recently from a7a2161 to 3aeac3b Compare August 6, 2025 21:13
Copy link
Contributor

@stefankandic stefankandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 3aeac3b to 767264d Compare August 7, 2025 09:11
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from 7fec8b9 to b178220 Compare August 7, 2025 13:22
@ilicmarkodb ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from b178220 to 41eb246 Compare August 7, 2025 14:15
@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 8, 2025

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6b1f1a6 Aug 8, 2025
@cloud-fan
Copy link
Contributor

@ilicmarkodb can you open a backport PR against branch-4.0?

@@ -3490,6 +3490,41 @@ def eval(self):
udtf(TestUDTF, returnType=ret_type)().collect()


def test_udtf_with_collated_string_types(self):
Copy link
Contributor

@zhengruifeng zhengruifeng Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilicmarkodb the indent here is wrong, this test is actually skipped. It should be put into a Mixin class like BaseUDTFTestsMixin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#52001

I opened a PR to fix this. I’ll finish it and tag you for review once the CI is green.

)
df = self.spark.createDataFrame([("hello",) * 4], schema=schema)

df_out = df.select(MyUDTF(df.col1, df.col2, df.col3, df.col4).alias("out"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this query work? I guess it should be a lateralJoin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn’t. I just didn’t realize that, since the test wasn't executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL #51688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants