[SPARK-33975][SQL] Include add_hours function #31000

MrPowers · 2021-01-03T23:58:59Z

What changes were proposed in this pull request?

This PR add an add_hours function. Here's how users currently need to add hours to a time column:

df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours"))

We don't want to make users manipulate strings in their Scala code. We also don't want to force users to pass around column names when they should be passing around Column objects.

The add_hours function will make this logic a lot more intuitive and consistent with the rest of the API:

df.withColumn("plus_2_hours", add_hours(col("first_datetime"), lit(2)))

The Stackoverflow question on this issue has 21,000 views, so this feature will be useful for a lot of users.

Spark 3 made some awesome improvements to the dates / times APIs and this PR is one example of an improvement that'll continue making these APIs easier to use.

Why are the changes needed?

There are the INTERVAL and UDF work-arounds, so this isn't strictly needed, but it makes the API a lot easier to work with when performing hour-based computations. It'll also make the answer easier to find. It's not easy to find the INTERVAL solution in the docs.

Does this PR introduce any user-facing change?

Yes, this adds the add_hours function to the org.apache.spark.sql.functions object which is a public facing API. The @since function annotation will need to be updated with the right version if this ends up getting merged in.

How was this patch tested?

Function was unit tested. The unit tests follow the testing patterns of similar SQL functions.

AmplabJenkins · 2021-01-04T00:00:19Z

Can one of the admins verify this patch?

HyukjinKwon · 2021-01-04T00:59:21Z

cc @MaxGekk FYI

HyukjinKwon · 2021-01-04T00:59:55Z

@MrPowers, can we file a JIRA?

MrPowers · 2021-01-04T01:22:14Z

@HyukjinKwon - thank you for commenting. I created a JIRA ticket. If you need anything else, just let me know and I'll get right on it!

MaxGekk

As an alternative possible solution, we could expose the make_interval() function (see #26446) to public Scala APIs (and to PySpark and R). For your example:

df.withColumn("plus_2_hours", $"first_datetime" + make_interval(hours = 2))

add_hours() could be added as an "alias" for + make_interval(hour = _)

My concerns about the functions are:

Probably we will need to add other functions for days, minutes and etc. that extends the edge of public APIs
It is a partial solution from my point of view. What if you need to add minutes, days and etc.
Maintainability costs of custom implementation of + interval

MrPowers · 2021-01-04T19:15:43Z

@MaxGekk - thanks for the review and for your excellent work on this project.

I really like the idea of exposing the make_interval() function. make_interval() is way more powerful than just add_hours. Lots of Spark users are struggling with datetime addition (and then drowning in UDF complexity), so I know this'll be a high value function for a lot of users.

I created a JIRA where we can discuss in more detail.

My concern with exposing the make_interval() function with the current implementation is that it takes a lot of arguments. make_interval(hours = 2) would rely on having default arguments, which I haven't seen in other org.apach.spark.sql.functions functions. I'm not sure the named parameter work around to avoid listing all 7 args would translate well to the Java, Python, and R APIs. Datetime objects in most languages are instantiated with 6 args (not including weeks), so that might be another confusing part of make_interval() for end users.

The JIRA outlines different ways we can expose this functionality. Look forward to working with you to decide on the best user interface and will then be happy to implement it. These functions will make users really happy :)

MaxGekk · 2021-01-05T07:41:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+
+

Could you remove the empty lines, please.

MaxGekk · 2021-01-05T07:43:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+  extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
+    with TimeZoneAwareExpression {


Please, fix indentation according to https://github.com/databricks/scala-style-guide#spacing-and-indentation

I read the scala-style-guide and wasn't able to decipher the best practice for this specific scenario. I opened a scala-style-guide issue to clarify. Do you think this is better?

extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant with TimeZoneAwareExpression {

I'm glad you made the comment because it's really important to manage whitespace consistently throughout the codebase.

MaxGekk · 2021-01-05T07:46:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    start: Long,
+    hours: Int,
+    zoneId: ZoneId): Long = {


Suggested change

start: Long,

hours: Int,

zoneId: ZoneId): Long = {

start: Long,

hours: Int,

zoneId: ZoneId): Long = {

see https://github.com/databricks/scala-style-guide#spacing-and-indentation

MaxGekk · 2021-01-05T07:48:10Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @param numHours The number of hours to add to `startTime`, can be negative to subtract hours
+   * @return A timestamp, or null if `startTime` was a string that could not be cast to a timestamp
+   * @group datetime_funcs
+   * @since TBD


MaxGekk · 2021-01-05T07:51:53Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+  test("function add_hours") {
+    val t1 = Timestamp.valueOf("2015-10-01 00:00:01")
+    val t2 = Timestamp.valueOf("2016-02-29 00:00:02")
+    val df = Seq((1, t1), (2, t2)).toDF("n", "t")


Do you use the n column somewhere?

MaxGekk · 2021-01-05T08:22:42Z

Also I think we should strictly define the semantic of adding hours. Are we adding hours to "physical" or "local" time. For example, https://www.timeanddate.com/time/change/usa/los-angeles?year=2019

Sunday, 3 November 2019, 02:00:00 clocks were turned backward 1 hour to
Sunday, 3 November 2019, 01:00:00 local standard time instead.

add_hours(timestamp '2019-11-03 00:30:00 America/Los_Angeles', 2) should return:

timestamp '2019-11-03 02:30:00 America/Los_Angeles' or
timestamp '2019-11-03 01:30:00 America/Los_Angeles' (2 hours in "physical" time)

srowen · 2021-01-05T17:53:22Z

Also does this suggest we need add_minutes, add_days, etc?

MrPowers · 2021-01-05T18:43:31Z

@srowen - yep, we need to expose functions that make it easy to perform datetime addition with any combination of years, months, days, weeks, hours, minutes, and seconds. This PR for add_hours is just part of the larger goal. See this ticket for the different ways we can expose this functionality. Feel free to comment with your thoughts.

All the hard foundational work is already done. We just need to agree on the best public facing APIs and write a bit of code to expose the arbitrary datetime addition functionality.

srowen · 2021-01-05T19:03:14Z

I get it, it's just a lot of new API surface for a small gain IMHO. I don't feel strongly about it. Is this in SQL too?

MrPowers · 2021-01-05T19:22:36Z

@srowen - Maybe it's best to start by exposing the make_interval() function so we're only expanding the API by a single function, but giving users the power to properly add datetimes. I'll create another PR with the make_interval() function and we can see how that feels. If make_interval(hours = 2) is sufficient, then we can close this PR and just go with the single function. Any implementation that lets users perform datetime addition without making them write UDFs will be a huge win.

MaxGekk · 2021-01-05T19:30:00Z

I'll create another PR with the make_interval() function ...

@cloud-fan @HyukjinKwon @dongjoon-hyun Are you ok with exposing this function via Scala, Python and R APIs?

cloud-fan · 2021-01-05T19:43:33Z

I think df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hour = 2)) is better.

HyukjinKwon · 2021-01-06T02:43:25Z

I'm okay with exposing make_interval.

MrPowers · 2021-01-06T22:10:00Z

Opened a PR for the make_interval function. Thanks everyone for the valuable feedback. make_interval is better than separate add_hours, add_seconds, etc. functions.

### What changes were proposed in this pull request? This pull request exposes the `make_interval` function, [as suggested here](#31000 (review)), and as agreed to [here](#31000 (comment)) and [here](#31000 (comment)). This powerful little function allows for idiomatic datetime arithmetic via the Scala API: ```scala // add two hours df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hours = lit(2))) // subtract one week and 30 seconds col("d") - make_interval(weeks = lit(1), secs = lit(30)) ``` The `make_interval` [SQL function](#26446) already exists. Here is [the JIRA ticket](https://issues.apache.org/jira/browse/SPARK-33995) for this PR. ### Why are the changes needed? The Spark API makes it easy to perform datetime addition / subtraction with months (`add_months`) and days (`date_add`). Users need to write code like this to perform datetime addition with years, weeks, hours, minutes, or seconds: ```scala df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours")) ``` We don't want to force users to manipulate SQL strings when they're using the Scala API. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds `make_interval` to the `org.apache.spark.sql.functions` API. This single function will benefit a lot of users. It's a small increase in the surface of the API for a big gain. ### How was this patch tested? This was tested via unit tests. cc: MaxGekk Closes #31073 from MrPowers/SPARK-33995. Authored-by: MrPowers <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Include add_hours function

871a4b3

github-actions bot added the SQL label Jan 3, 2021

MrPowers changed the title ~~[SPARK-XXXXX][SQL] Include add_hours function~~ [SPARK-33975][SQL] Include add_hours function Jan 4, 2021

MaxGekk reviewed Jan 4, 2021

View reviewed changes

MaxGekk reviewed Jan 5, 2021

View reviewed changes

Address code review comments

ac47657

MrPowers mentioned this pull request Jan 6, 2021

[SPARK-33995][SQL] Expose make_interval as a Scala function #31073

Closed

MrPowers closed this Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33975][SQL] Include add_hours function #31000

[SPARK-33975][SQL] Include add_hours function #31000

MrPowers commented Jan 3, 2021 •

edited

Loading

AmplabJenkins commented Jan 4, 2021

HyukjinKwon commented Jan 4, 2021

HyukjinKwon commented Jan 4, 2021

MrPowers commented Jan 4, 2021

MaxGekk left a comment •

edited

Loading

MrPowers commented Jan 4, 2021

MaxGekk Jan 5, 2021

MaxGekk Jan 5, 2021

MrPowers Jan 5, 2021

MaxGekk Jan 5, 2021

MaxGekk Jan 5, 2021

MaxGekk Jan 5, 2021

MaxGekk commented Jan 5, 2021

srowen commented Jan 5, 2021

MrPowers commented Jan 5, 2021

srowen commented Jan 5, 2021

MrPowers commented Jan 5, 2021

MaxGekk commented Jan 5, 2021

cloud-fan commented Jan 5, 2021

HyukjinKwon commented Jan 6, 2021

MrPowers commented Jan 6, 2021

		extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
		with TimeZoneAwareExpression {

[SPARK-33975][SQL] Include add_hours function #31000

[SPARK-33975][SQL] Include add_hours function #31000

Conversation

MrPowers commented Jan 3, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Jan 4, 2021

HyukjinKwon commented Jan 4, 2021

HyukjinKwon commented Jan 4, 2021

MrPowers commented Jan 4, 2021

MaxGekk left a comment • edited Loading

Choose a reason for hiding this comment

MrPowers commented Jan 4, 2021

MaxGekk Jan 5, 2021

Choose a reason for hiding this comment

MaxGekk Jan 5, 2021

Choose a reason for hiding this comment

MrPowers Jan 5, 2021

Choose a reason for hiding this comment

MaxGekk Jan 5, 2021

Choose a reason for hiding this comment

MaxGekk Jan 5, 2021

Choose a reason for hiding this comment

MaxGekk Jan 5, 2021

Choose a reason for hiding this comment

MaxGekk commented Jan 5, 2021

srowen commented Jan 5, 2021

MrPowers commented Jan 5, 2021

srowen commented Jan 5, 2021

MrPowers commented Jan 5, 2021

MaxGekk commented Jan 5, 2021

cloud-fan commented Jan 5, 2021

HyukjinKwon commented Jan 6, 2021

MrPowers commented Jan 6, 2021

MrPowers commented Jan 3, 2021 •

edited

Loading

MaxGekk left a comment •

edited

Loading