Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33975][SQL] Include add_hours function #31000

Closed
wants to merge 2 commits into from

Conversation

MrPowers
Copy link
Contributor

@MrPowers MrPowers commented Jan 3, 2021

What changes were proposed in this pull request?

This PR add an add_hours function. Here's how users currently need to add hours to a time column:

df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours"))

We don't want to make users manipulate strings in their Scala code. We also don't want to force users to pass around column names when they should be passing around Column objects.

The add_hours function will make this logic a lot more intuitive and consistent with the rest of the API:

df.withColumn("plus_2_hours", add_hours(col("first_datetime"), lit(2)))

The Stackoverflow question on this issue has 21,000 views, so this feature will be useful for a lot of users.

Spark 3 made some awesome improvements to the dates / times APIs and this PR is one example of an improvement that'll continue making these APIs easier to use.

Why are the changes needed?

There are the INTERVAL and UDF work-arounds, so this isn't strictly needed, but it makes the API a lot easier to work with when performing hour-based computations. It'll also make the answer easier to find. It's not easy to find the INTERVAL solution in the docs.

Does this PR introduce any user-facing change?

Yes, this adds the add_hours function to the org.apache.spark.sql.functions object which is a public facing API. The @since function annotation will need to be updated with the right version if this ends up getting merged in.

How was this patch tested?

Function was unit tested. The unit tests follow the testing patterns of similar SQL functions.

@github-actions github-actions bot added the SQL label Jan 3, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

cc @MaxGekk FYI

@HyukjinKwon
Copy link
Member

@MrPowers, can we file a JIRA?

@MrPowers MrPowers changed the title [SPARK-XXXXX][SQL] Include add_hours function [SPARK-33975][SQL] Include add_hours function Jan 4, 2021
@MrPowers
Copy link
Contributor Author

MrPowers commented Jan 4, 2021

@HyukjinKwon - thank you for commenting. I created a JIRA ticket. If you need anything else, just let me know and I'll get right on it!

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative possible solution, we could expose the make_interval() function (see #26446) to public Scala APIs (and to PySpark and R). For your example:

df.withColumn("plus_2_hours", $"first_datetime" + make_interval(hours = 2))

add_hours() could be added as an "alias" for + make_interval(hour = _)

My concerns about the functions are:

  1. Probably we will need to add other functions for days, minutes and etc. that extends the edge of public APIs
  2. It is a partial solution from my point of view. What if you need to add minutes, days and etc.
  3. Maintainability costs of custom implementation of + interval

@MrPowers
Copy link
Contributor Author

MrPowers commented Jan 4, 2021

@MaxGekk - thanks for the review and for your excellent work on this project.

I really like the idea of exposing the make_interval() function. make_interval() is way more powerful than just add_hours. Lots of Spark users are struggling with datetime addition (and then drowning in UDF complexity), so I know this'll be a high value function for a lot of users.

I created a JIRA where we can discuss in more detail.

My concern with exposing the make_interval() function with the current implementation is that it takes a lot of arguments. make_interval(hours = 2) would rely on having default arguments, which I haven't seen in other org.apach.spark.sql.functions functions. I'm not sure the named parameter work around to avoid listing all 7 args would translate well to the Java, Python, and R APIs. Datetime objects in most languages are instantiated with 6 args (not including weeks), so that might be another confusing part of make_interval() for end users.

The JIRA outlines different ways we can expose this functionality. Look forward to working with you to decide on the best user interface and will then be happy to implement it. These functions will make users really happy :)

Comment on lines 1448 to 1449


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the empty lines, please.

Comment on lines +1451 to +1452
extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
with TimeZoneAwareExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the scala-style-guide and wasn't able to decipher the best practice for this specific scenario. I opened a scala-style-guide issue to clarify. Do you think this is better?

  extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
  with TimeZoneAwareExpression {

I'm glad you made the comment because it's really important to manage whitespace consistently throughout the codebase.

Comment on lines 583 to 585
start: Long,
hours: Int,
zoneId: ZoneId): Long = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
start: Long,
hours: Int,
zoneId: ZoneId): Long = {
start: Long,
hours: Int,
zoneId: ZoneId): Long = {

see https://github.com/databricks/scala-style-guide#spacing-and-indentation

* @param numHours The number of hours to add to `startTime`, can be negative to subtract hours
* @return A timestamp, or null if `startTime` was a string that could not be cast to a timestamp
* @group datetime_funcs
* @since TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.2.0

test("function add_hours") {
val t1 = Timestamp.valueOf("2015-10-01 00:00:01")
val t2 = Timestamp.valueOf("2016-02-29 00:00:02")
val df = Seq((1, t1), (2, t2)).toDF("n", "t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use the n column somewhere?

@MaxGekk
Copy link
Member

MaxGekk commented Jan 5, 2021

Also I think we should strictly define the semantic of adding hours. Are we adding hours to "physical" or "local" time. For example, https://www.timeanddate.com/time/change/usa/los-angeles?year=2019

Sunday, 3 November 2019, 02:00:00 clocks were turned backward 1 hour to
Sunday, 3 November 2019, 01:00:00 local standard time instead.

add_hours(timestamp '2019-11-03 00:30:00 America/Los_Angeles', 2) should return:

  1. timestamp '2019-11-03 02:30:00 America/Los_Angeles' or
  2. timestamp '2019-11-03 01:30:00 America/Los_Angeles' (2 hours in "physical" time)

@srowen
Copy link
Member

srowen commented Jan 5, 2021

Also does this suggest we need add_minutes, add_days, etc?

@MrPowers
Copy link
Contributor Author

MrPowers commented Jan 5, 2021

@srowen - yep, we need to expose functions that make it easy to perform datetime addition with any combination of years, months, days, weeks, hours, minutes, and seconds. This PR for add_hours is just part of the larger goal. See this ticket for the different ways we can expose this functionality. Feel free to comment with your thoughts.

All the hard foundational work is already done. We just need to agree on the best public facing APIs and write a bit of code to expose the arbitrary datetime addition functionality.

@srowen
Copy link
Member

srowen commented Jan 5, 2021

I get it, it's just a lot of new API surface for a small gain IMHO. I don't feel strongly about it. Is this in SQL too?

@MrPowers
Copy link
Contributor Author

MrPowers commented Jan 5, 2021

@srowen - Maybe it's best to start by exposing the make_interval() function so we're only expanding the API by a single function, but giving users the power to properly add datetimes. I'll create another PR with the make_interval() function and we can see how that feels. If make_interval(hours = 2) is sufficient, then we can close this PR and just go with the single function. Any implementation that lets users perform datetime addition without making them write UDFs will be a huge win.

@MaxGekk
Copy link
Member

MaxGekk commented Jan 5, 2021

I'll create another PR with the make_interval() function ...

@cloud-fan @HyukjinKwon @dongjoon-hyun Are you ok with exposing this function via Scala, Python and R APIs?

@cloud-fan
Copy link
Contributor

I think df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hour = 2)) is better.

@HyukjinKwon
Copy link
Member

I'm okay with exposing make_interval.

@MrPowers
Copy link
Contributor Author

MrPowers commented Jan 6, 2021

Opened a PR for the make_interval function. Thanks everyone for the valuable feedback. make_interval is better than separate add_hours, add_seconds, etc. functions.

@MrPowers MrPowers closed this Jan 6, 2021
cloud-fan pushed a commit that referenced this pull request Feb 10, 2021
### What changes were proposed in this pull request?

This pull request exposes the `make_interval` function, [as suggested here](#31000 (review)), and as agreed to [here](#31000 (comment)) and [here](#31000 (comment)).

This powerful little function allows for idiomatic datetime arithmetic via the Scala API:

```scala
// add two hours
df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hours = lit(2)))

// subtract one week and 30 seconds
col("d") - make_interval(weeks = lit(1), secs = lit(30))
```

The `make_interval` [SQL function](#26446) already exists.

Here is [the JIRA ticket](https://issues.apache.org/jira/browse/SPARK-33995) for this PR.

### Why are the changes needed?

The Spark API makes it easy to perform datetime addition / subtraction with months (`add_months`) and days (`date_add`).  Users need to write code like this to perform datetime addition with years, weeks, hours, minutes, or seconds:

```scala
df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours"))
```

We don't want to force users to manipulate SQL strings when they're using the Scala API.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds `make_interval` to the `org.apache.spark.sql.functions` API.

This single function will benefit a lot of users.  It's a small increase in the surface of the API for a big gain.

### How was this patch tested?

This was tested via unit tests.

cc: MaxGekk

Closes #31073 from MrPowers/SPARK-33995.

Authored-by: MrPowers <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants