Skip to content

[SPARK-52249][PS] Enable divide-by-zero for truediv with ANSI enabled #50972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented May 21, 2025

What changes were proposed in this pull request?

Enable divide-by-zero for truediv with ANSI enabled

Why are the changes needed?

Part of https://issues.apache.org/jira/browse/SPARK-52169

Does this PR introduce any user-facing change?

Yes, divide-by-zero for truediv is enabled with ANSI enabled

>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> pdf = pd.DataFrame({"a": [1.0, -1.0, 0.0, np.nan], "b": [0.0, 0.0, 0.0, 0.0]})
>>> psdf = ps.from_pandas(pdf)

FROM

>>> psdf["a"] / psdf["b"]
...
pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"__div__" was called from
<stdin>:1

TO

>>> psdf["a"] / psdf["b"]
0    inf                                                                        
1   -inf
2    NaN
3    NaN
dtype: float64

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-52249][PS] Enable divide-by-zero with ANSI enabled [SPARK-52249][PS] Enable divide-by-zero with ANSI enabled May 23, 2025
@xinrong-meng xinrong-meng marked this pull request as ready for review May 23, 2025 20:17
@xinrong-meng
Copy link
Member Author

xinrong-meng commented May 23, 2025

Should be merged after #51035

@xinrong-meng xinrong-meng force-pushed the divide_0 branch 2 times, most recently from 2c60487 to 443e641 Compare May 27, 2025 20:41
F.lit(right != 0) | F.lit(right).isNull(),
left.__div__(right),
).otherwise(F.lit(np.inf).__div__(left))
if not get_option("compute.ansi_mode_support"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we optimize out this get_option which needs a separate Config RPC?
I guess we can just use the new branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry would you mind clarifying what you meant?

@xinrong-meng xinrong-meng requested a review from ueshin May 28, 2025 18:19
@@ -111,7 +111,6 @@ def test_binary_operator_sub(self):
psdf = ps.DataFrame({"a": ["x"], "b": ["y"]})
self.assertRaisesRegex(TypeError, ks_err_msg, lambda: psdf["a"] - psdf["b"])

@unittest.skipIf(is_ansi_mode_test, ansi_mode_not_supported_message)
Copy link
Member

@ueshin ueshin May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also try to find where we can remove skipping caused by the division error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I follow up with https://issues.apache.org/jira/browse/SPARK-52349 on that if you don't mind, in order to unblock the other pr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. 👍

@xinrong-meng xinrong-meng changed the title [SPARK-52249][PS] Enable divide-by-zero with ANSI enabled [SPARK-52249][PS] Enable divide-by-zero for truediv with ANSI enabled May 28, 2025
@@ -1070,6 +1070,14 @@ def xor(df1: PySparkDataFrame, df2: PySparkDataFrame) -> PySparkDataFrame:
)


def is_ansi_mode_enabled() -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we explicitly pass spark? SparkSession.getActiveSession() is not light.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted.

@xinrong-meng xinrong-meng requested a review from ueshin May 29, 2025 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants