[BugFix] Fix pre-1970 Parquet timestamp load corrupting sub-second DATETIME#75207
Merged
xiangguangyxg merged 1 commit intoJun 26, 2026
Merged
Conversation
…TETIME When loading a Parquet INT64 TIMESTAMP (isAdjustedToUTC=false) into a DATETIME, Int64ToDateTimeConverter split the signed epoch tick with C++ truncating division. For a pre-1970 tick with a nonzero sub-second part the remainder is negative, so nanoseconds was negative; timestamp::of_epoch_second then packs it via a bitwise OR (from_julian_and_time), and the negative microsecond corrupts the packed Julian field — the loaded DATETIME became a garbage value (e.g. year 41222) instead of the real wall clock. Borrow a whole second when the sub-second remainder is negative so nanoseconds stays in [0, NANOSECS_PER_SEC), matching the floor split the FE boundary computation uses. Whole-second and post-1970 values are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
|
@codex review |
1 similar comment
Contributor
Author
|
@codex review |
|
Codex Review: Didn't find any major issues. Delightful! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Contributor
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Contributor
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Contributor
[BE Incremental Coverage Report]✅ pass : 3 / 3 (100.00%) file detail
|
kevincai
approved these changes
Jun 25, 2026
srlch
approved these changes
Jun 26, 2026
Contributor
|
Tick the box to add this pull request to the merge queue (same as
|
Contributor
|
@Mergifyio backport branch-4.1 |
Contributor
✅ Backports have been createdDetails
|
This was referenced Jun 26, 2026
xiangguangyxg
added a commit
to xiangguangyxg/starrocks
that referenced
this pull request
Jun 26, 2026
OrcTimestampHelper::orc_ts_to_native_ts_before_unix_epoch hardcoded the microsecond argument to 0, so loading any pre-1970 ORC TIMESTAMP with a non-zero sub-second dropped the fraction (e.g. 1965-03-02 12:00:00.500000 loaded as 1965-03-02 12:00:00.000000) on both the plain and the instant (timestamp with local time zone) read paths. Pass nanoseconds / NANOSECS_PER_USEC instead, mirroring the after-epoch path. The shared helper is also used by the ORC stripe min/max statistics decoder; keep its pre-1970 bounds byte-for-byte unchanged by dropping the sub-second for negative-epoch bounds there, leaving the decoder's separate pre-existing stats-nanos handling for a follow-up. Companion to the merged Parquet fix StarRocks#75207. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
xiangguangyxg
added a commit
to xiangguangyxg/starrocks
that referenced
this pull request
Jun 26, 2026
OrcTimestampHelper::orc_ts_to_native_ts_before_unix_epoch hardcoded the microsecond argument to 0, so loading any pre-1970 ORC TIMESTAMP with a non-zero sub-second dropped the fraction (e.g. 1965-03-02 12:00:00.500000 loaded as 1965-03-02 12:00:00.000000) on both the plain and the instant (timestamp with local time zone) read paths. Pass nanoseconds / NANOSECS_PER_USEC instead, mirroring the after-epoch path. The shared helper is also used by the ORC stripe min/max statistics decoder; keep its pre-1970 bounds byte-for-byte unchanged by dropping the sub-second for negative-epoch bounds there, leaving the decoder's separate pre-existing stats-nanos handling for a follow-up. Companion to the merged Parquet fix StarRocks#75207. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I'm doing:
When loading a Parquet
INT64column annotatedTIMESTAMP(isAdjustedToUTC=false) into a StarRocksDATETIME, a pre-1970 value with a nonzero sub-second part was decoded to a corrupt garbage value instead of the real wall clock.Int64ToDateTimeConverter::convertsplits the signed epoch tick with C++ truncating division:For a negative tick whose sub-second remainder is nonzero,
nanosecondsis negative.timestamp::of_epoch_secondthen packs the result via a bitwise OR (from_julian_and_time), so the negative microsecond corrupts the packed Julian field — e.g.1969-12-31 23:59:59.500loaded as a year-41222 garbage value. Whole-second negatives and all post-1970 values were unaffected.What I'm doing:
Borrow a whole second when the sub-second remainder is negative, so
nanosecondsstays in[0, NANOSECS_PER_SEC)— the floor split the FE boundary computation already uses (Math.floorDiv/Math.floorMod).of_epoch_secondthen receives a non-negative sub-second and packs the correct value. The borrow is unit-agnostic (MILLIS/MICROS/NANOS) and runs before the UTC whole-second offset, so it composes with the timezone-adjusted branch unchanged.Added a regression test (
Int64PreEpochTimestampSubSecond) that loads a Parquet file holding1969-12-31 23:59:59.500000in both MILLIS and MICROS units: it decoded to a garbage value before the fix and to the correct wall clock after. The existing post-1970Int64_2_Timestamptest is unchanged (14/14ColumnConverterTestpass).What type of PR is this:
Does this PR entail a change in behavior?
Checklist:
Bugfix cherry-pick branch check: