-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33975][SQL] Include add_hours function #31000
Conversation
Can one of the admins verify this patch? |
cc @MaxGekk FYI |
@MrPowers, can we file a JIRA? |
@HyukjinKwon - thank you for commenting. I created a JIRA ticket. If you need anything else, just let me know and I'll get right on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative possible solution, we could expose the make_interval()
function (see #26446) to public Scala APIs (and to PySpark and R). For your example:
df.withColumn("plus_2_hours", $"first_datetime" + make_interval(hours = 2))
add_hours()
could be added as an "alias" for + make_interval(hour = _)
My concerns about the functions are:
- Probably we will need to add other functions for days, minutes and etc. that extends the edge of public APIs
- It is a partial solution from my point of view. What if you need to add minutes, days and etc.
- Maintainability costs of custom implementation of
+ interval
@MaxGekk - thanks for the review and for your excellent work on this project. I really like the idea of exposing the I created a JIRA where we can discuss in more detail. My concern with exposing the The JIRA outlines different ways we can expose this functionality. Look forward to working with you to decide on the best user interface and will then be happy to implement it. These functions will make users really happy :) |
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the empty lines, please.
extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant | ||
with TimeZoneAwareExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, fix indentation according to https://github.com/databricks/scala-style-guide#spacing-and-indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the scala-style-guide and wasn't able to decipher the best practice for this specific scenario. I opened a scala-style-guide issue to clarify. Do you think this is better?
extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
with TimeZoneAwareExpression {
I'm glad you made the comment because it's really important to manage whitespace consistently throughout the codebase.
start: Long, | ||
hours: Int, | ||
zoneId: ZoneId): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start: Long, | |
hours: Int, | |
zoneId: ZoneId): Long = { | |
start: Long, | |
hours: Int, | |
zoneId: ZoneId): Long = { |
see https://github.com/databricks/scala-style-guide#spacing-and-indentation
* @param numHours The number of hours to add to `startTime`, can be negative to subtract hours | ||
* @return A timestamp, or null if `startTime` was a string that could not be cast to a timestamp | ||
* @group datetime_funcs | ||
* @since TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.2.0
test("function add_hours") { | ||
val t1 = Timestamp.valueOf("2015-10-01 00:00:01") | ||
val t2 = Timestamp.valueOf("2016-02-29 00:00:02") | ||
val df = Seq((1, t1), (2, t2)).toDF("n", "t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use the n
column somewhere?
Also I think we should strictly define the semantic of adding hours. Are we adding hours to "physical" or "local" time. For example, https://www.timeanddate.com/time/change/usa/los-angeles?year=2019
add_hours(timestamp '2019-11-03 00:30:00 America/Los_Angeles', 2) should return:
|
Also does this suggest we need add_minutes, add_days, etc? |
@srowen - yep, we need to expose functions that make it easy to perform datetime addition with any combination of years, months, days, weeks, hours, minutes, and seconds. This PR for All the hard foundational work is already done. We just need to agree on the best public facing APIs and write a bit of code to expose the arbitrary datetime addition functionality. |
I get it, it's just a lot of new API surface for a small gain IMHO. I don't feel strongly about it. Is this in SQL too? |
@srowen - Maybe it's best to start by exposing the |
@cloud-fan @HyukjinKwon @dongjoon-hyun Are you ok with exposing this function via Scala, Python and R APIs? |
I think |
I'm okay with exposing |
Opened a PR for the |
### What changes were proposed in this pull request? This pull request exposes the `make_interval` function, [as suggested here](#31000 (review)), and as agreed to [here](#31000 (comment)) and [here](#31000 (comment)). This powerful little function allows for idiomatic datetime arithmetic via the Scala API: ```scala // add two hours df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hours = lit(2))) // subtract one week and 30 seconds col("d") - make_interval(weeks = lit(1), secs = lit(30)) ``` The `make_interval` [SQL function](#26446) already exists. Here is [the JIRA ticket](https://issues.apache.org/jira/browse/SPARK-33995) for this PR. ### Why are the changes needed? The Spark API makes it easy to perform datetime addition / subtraction with months (`add_months`) and days (`date_add`). Users need to write code like this to perform datetime addition with years, weeks, hours, minutes, or seconds: ```scala df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours")) ``` We don't want to force users to manipulate SQL strings when they're using the Scala API. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds `make_interval` to the `org.apache.spark.sql.functions` API. This single function will benefit a lot of users. It's a small increase in the surface of the API for a big gain. ### How was this patch tested? This was tested via unit tests. cc: MaxGekk Closes #31073 from MrPowers/SPARK-33995. Authored-by: MrPowers <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR add an
add_hours
function. Here's how users currently need to add hours to a time column:We don't want to make users manipulate strings in their Scala code. We also don't want to force users to pass around column names when they should be passing around Column objects.
The
add_hours
function will make this logic a lot more intuitive and consistent with the rest of the API:The Stackoverflow question on this issue has 21,000 views, so this feature will be useful for a lot of users.
Spark 3 made some awesome improvements to the dates / times APIs and this PR is one example of an improvement that'll continue making these APIs easier to use.
Why are the changes needed?
There are the
INTERVAL
and UDF work-arounds, so this isn't strictly needed, but it makes the API a lot easier to work with when performing hour-based computations. It'll also make the answer easier to find. It's not easy to find theINTERVAL
solution in the docs.Does this PR introduce any user-facing change?
Yes, this adds the
add_hours
function to theorg.apache.spark.sql.functions
object which is a public facing API. The@since
function annotation will need to be updated with the right version if this ends up getting merged in.How was this patch tested?
Function was unit tested. The unit tests follow the testing patterns of similar SQL functions.