Skip to content

[BugFix] Fix pre-1970 Parquet timestamp load corrupting sub-second DATETIME#75207

Merged
xiangguangyxg merged 1 commit into
StarRocks:mainfrom
xiangguangyxg:be-parquet-pre1970-ts
Jun 26, 2026
Merged

[BugFix] Fix pre-1970 Parquet timestamp load corrupting sub-second DATETIME#75207
xiangguangyxg merged 1 commit into
StarRocks:mainfrom
xiangguangyxg:be-parquet-pre1970-ts

Conversation

@xiangguangyxg

Copy link
Copy Markdown
Contributor

Why I'm doing:

When loading a Parquet INT64 column annotated TIMESTAMP (isAdjustedToUTC=false) into a StarRocks DATETIME, a pre-1970 value with a nonzero sub-second part was decoded to a corrupt garbage value instead of the real wall clock.

Int64ToDateTimeConverter::convert splits the signed epoch tick with C++ truncating division:

int64_t seconds     = src_data[i] / _second_mask;
int64_t nanoseconds = (src_data[i] % _second_mask) * _scale_to_nano_factor;

For a negative tick whose sub-second remainder is nonzero, nanoseconds is negative. timestamp::of_epoch_second then packs the result via a bitwise OR (from_julian_and_time), so the negative microsecond corrupts the packed Julian field — e.g. 1969-12-31 23:59:59.500 loaded as a year-41222 garbage value. Whole-second negatives and all post-1970 values were unaffected.

What I'm doing:

Borrow a whole second when the sub-second remainder is negative, so nanoseconds stays in [0, NANOSECS_PER_SEC) — the floor split the FE boundary computation already uses (Math.floorDiv/Math.floorMod). of_epoch_second then receives a non-negative sub-second and packs the correct value. The borrow is unit-agnostic (MILLIS/MICROS/NANOS) and runs before the UTC whole-second offset, so it composes with the timezone-adjusted branch unchanged.

Added a regression test (Int64PreEpochTimestampSubSecond) that loads a Parquet file holding 1969-12-31 23:59:59.500000 in both MILLIS and MICROS units: it decoded to a garbage value before the fix and to the correct wall clock after. The existing post-1970 Int64_2_Timestamp test is unchanged (14/14 ColumnConverterTest pass).

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

…TETIME

When loading a Parquet INT64 TIMESTAMP (isAdjustedToUTC=false) into a
DATETIME, Int64ToDateTimeConverter split the signed epoch tick with C++
truncating division. For a pre-1970 tick with a nonzero sub-second part the
remainder is negative, so nanoseconds was negative; timestamp::of_epoch_second
then packs it via a bitwise OR (from_julian_and_time), and the negative
microsecond corrupts the packed Julian field — the loaded DATETIME became a
garbage value (e.g. year 41222) instead of the real wall clock.

Borrow a whole second when the sub-second remainder is negative so nanoseconds
stays in [0, NANOSECS_PER_SEC), matching the floor split the FE boundary
computation uses. Whole-second and post-1970 values are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

1 similar comment
@xiangguangyxg

Copy link
Copy Markdown
Contributor Author

@codex review

@github-actions github-actions Bot requested review from dirtysalt and trueeyu June 23, 2026 07:50
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

Reviewed commit: 3abdcc04ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 3 / 3 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/formats/parquet/column_converter.cpp 3 3 100.00% []

@xiangguangyxg xiangguangyxg requested a review from kevincai June 24, 2026 13:20
@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@xiangguangyxg xiangguangyxg merged commit 0f52fa0 into StarRocks:main Jun 26, 2026
109 of 112 checks passed
@xiangguangyxg xiangguangyxg deleted the be-parquet-pre1970-ts branch June 26, 2026 02:11
@github-actions

Copy link
Copy Markdown
Contributor

@Mergifyio backport branch-4.1

@github-actions github-actions Bot removed the 4.1 label Jun 26, 2026
@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

backport branch-4.1

✅ Backports have been created

Details

wanpengfei-git pushed a commit that referenced this pull request Jun 26, 2026
…TETIME (backport #75207) (#75385)

Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
Co-authored-by: xiangguangyxg <110401425+xiangguangyxg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
xiangguangyxg added a commit to xiangguangyxg/starrocks that referenced this pull request Jun 26, 2026
OrcTimestampHelper::orc_ts_to_native_ts_before_unix_epoch hardcoded the
microsecond argument to 0, so loading any pre-1970 ORC TIMESTAMP with a
non-zero sub-second dropped the fraction (e.g. 1965-03-02 12:00:00.500000
loaded as 1965-03-02 12:00:00.000000) on both the plain and the instant
(timestamp with local time zone) read paths. Pass nanoseconds /
NANOSECS_PER_USEC instead, mirroring the after-epoch path.

The shared helper is also used by the ORC stripe min/max statistics decoder;
keep its pre-1970 bounds byte-for-byte unchanged by dropping the sub-second
for negative-epoch bounds there, leaving the decoder's separate pre-existing
stats-nanos handling for a follow-up.

Companion to the merged Parquet fix StarRocks#75207.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
xiangguangyxg added a commit to xiangguangyxg/starrocks that referenced this pull request Jun 26, 2026
OrcTimestampHelper::orc_ts_to_native_ts_before_unix_epoch hardcoded the
microsecond argument to 0, so loading any pre-1970 ORC TIMESTAMP with a
non-zero sub-second dropped the fraction (e.g. 1965-03-02 12:00:00.500000
loaded as 1965-03-02 12:00:00.000000) on both the plain and the instant
(timestamp with local time zone) read paths. Pass nanoseconds /
NANOSECS_PER_USEC instead, mirroring the after-epoch path.

The shared helper is also used by the ORC stripe min/max statistics decoder;
keep its pre-1970 bounds byte-for-byte unchanged by dropping the sub-second
for negative-epoch bounds there, leaving the decoder's separate pre-existing
stats-nanos handling for a follow-up.

Companion to the merged Parquet fix StarRocks#75207.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants