Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting large dates (i.e. +10999-12-31) from string to Date32 #7074

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

phillipleblanc
Copy link
Contributor

@phillipleblanc phillipleblanc commented Feb 4, 2025

Which issue does this PR close?

Rationale for this change

Support for casting large dates from string to Date32.

What changes are included in this PR?

Extend the parse_date method, which is used in the impl Parser for Date32Type, to handle dates which are prefixed with + or -. If the date is not prefixed with + or -, the existing logic is used unmodified.

This code isn't as optimized as the code for processing more common date formats - but given that these extended dates are relatively rare in practice, I don't think it matters all that much.

Are there any user-facing changes?

Aside from the desired fix, no.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 4, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @phillipleblanc -- this looks readonable to me. I only think the PR needs a few more tests and we can merge.

I am sure we could make parsing dates like this faster but we can do that type of optimization as a follow up.

I am also running the cast benchmarks just to be sure this doesn't accidentally introduce a regression and will post the results to this PR

Some("-0010-02-28"),
])) as ArrayRef;
let to_type = DataType::Date32;
let options = CastOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the changes in this PR this code fails like this (so the test does cover it)

called Result::unwrap() on an Err value: CastError("Cannot cast string '+10999-12-31' to value of Date32 type")

fn test_cast_string_with_large_date_to_date32() {
let array = Arc::new(StringArray::from(vec![
Some("+10999-12-31"),
Some("-0010-02-28"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add tests for:

            Some("0010-02-28"),

(to show that the negative value actually results in a different parsed value)

"0000-00-00"
"-0000-01-01"
"-0001-01-01"

(for boundary cases)

Also a test that shows trying to parse Some("10999-12-31"), (more than 4 year digits) correctly results in an error?

@@ -595,6 +595,26 @@ const EPOCH_DAYS_FROM_CE: i32 = 719_163;
const ERR_NANOSECONDS_NOT_SUPPORTED: &str = "The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804";

fn parse_date(string: &str) -> Option<NaiveDate> {
// If the date has an extended (signed) year such as "+10999-12-31" or "-0012-05-06"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was very non clear to me at first that the sign results in different rules. Can we add some comments to clarify? Something like:

Also, can you include a liml to the reference were you got that sentenece?

Suggested change
// If the date has an extended (signed) year such as "+10999-12-31" or "-0012-05-06"
// If the date has an extended (signed) year such as "+10999-12-31" or "-0012-05-06"
//
// According to ISO 8601, years have:
// Four digits or more for the year. Years in the range 0000 to 9999 will be pre-padded by
// zero to ensure four digits. Years outside that range will have a prefixed positive or negative symbol.

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

++ critcmp main phillip_250205-handle-large-dates
group         main                                   phillip_250205-handle-large-dates
-----         ----                                   ---------------------------------
2020-09-08    1.00     21.5±0.05ns        ? ?/sec    1.04     22.2±0.05ns        ? ?/sec
2020-09-8     1.01     19.0±0.08ns        ? ?/sec    1.00     18.8±0.03ns        ? ?/sec
2020-9-08     1.00     18.6±0.04ns        ? ?/sec    1.04     19.4±0.15ns        ? ?/sec
2020-9-8      1.00     17.4±0.02ns        ? ?/sec    1.01     17.5±0.02ns        ? ?/sec

Seems ok to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting strings to Date32 that contain large dates
3 participants