-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688
Conversation
2584dab
to
2f1bee5
Compare
2f1bee5
to
0f47248
Compare
0f47248
to
94be795
Compare
cc8c888
to
b2761fe
Compare
b2761fe
to
a1f4c93
Compare
01ce67e
to
e1309ad
Compare
90d978d
to
4e73996
Compare
e5fa93c
to
b220cde
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
Outdated
Show resolved
Hide resolved
b594689
to
72237fa
Compare
…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 19ea6ff) Signed-off-by: Wenchen Fan <[email protected]>
a7a2161
to
3aeac3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
Outdated
Show resolved
Hide resolved
3aeac3b
to
767264d
Compare
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
Outdated
Show resolved
Hide resolved
7fec8b9
to
b178220
Compare
b178220
to
41eb246
Compare
thanks, merging to master! |
@ilicmarkodb can you open a backport PR against branch-4.0? |
@@ -3490,6 +3490,41 @@ def eval(self): | |||
udtf(TestUDTF, returnType=ret_type)().collect() | |||
|
|||
|
|||
def test_udtf_with_collated_string_types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilicmarkodb the indent here is wrong, this test is actually skipped. It should be put into a Mixin class like BaseUDTFTestsMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a PR to fix this. I’ll finish it and tag you for review once the CI is green.
) | ||
df = self.spark.createDataFrame([("hello",) * 4], schema=schema) | ||
|
||
df_out = df.select(MyUDTF(df.col1, df.col2, df.col3, df.col4).alias("out")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this query work? I guess it should be a lateralJoin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t. I just didn’t realize that, since the test wasn't executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL #51688
What changes were proposed in this pull request?
Fix Python UDF not accepting collated strings as input param/return type.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?
No.