[BugFix] Preserve sub-second when loading pre-1970 ORC TIMESTAMP (backport #75432)#75506
Merged
Merged
Conversation
) Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 29ef34a)
23 tasks
xiangguangyxg
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I'm doing:
OrcTimestampHelper::orc_ts_to_native_ts_before_unix_epoch(be/src/formats/orc/utils.h) hardcoded the microsecond argument to0, so loading any pre-1970 ORCTIMESTAMPwith a non-zero sub-second dropped the fraction, e.g.1965-03-02 12:00:00.500000was loaded as1965-03-02 12:00:00.000000. It affects both the plainTIMESTAMPand theTIMESTAMP WITH LOCAL TIME ZONE(instant) read paths and is independent of tablet pre-split.The
after_unix_epochpath already carries the sub-second through (nanoseconds / 1000); only thebefore_unix_epochbranch threw it away. liborc hands the load path a clean, floored(seconds, nanoseconds)pair (nanosecondsin[0, 1e9),instant = data + nanoseconds/1e9): the writer's+1(ColumnWriter.cc) and reader's-1(ColumnReader.cc) for negative sub-second values cancel.What I'm doing:
be/src/formats/orc/utils.h, inorc_ts_to_native_ts_before_unix_epoch: passnanoseconds / NANOSECS_PER_USECas the microsecond instead of0, mirroring the unguardedafter_unix_epochsibling. The row load path always supplies a non-negativenanoseconds, so this simply restores the sub-second. The dispatcher and theafter_unix_epochpath are untouched, so timezone composition, whole-second negatives, and post-1970 values are unchanged.This helper is also shared by the ORC stripe min/max statistics decoder (
orc_min_max_decoder.cpp), whose pre-1970 sub-second handling has separate, pre-existing quirks (the ORC nanos field is stored with a+1offset the decoder does not undo; its truncating-division remainder can be negative; the before-epoch instant branch ignores the reader offset). To keep this change scoped to the data load path and leave predicate-pushdown bounds byte-for-byte unchanged, the decoder now explicitly drops the sub-second for negative-epoch bounds (its prior behavior) rather than letting the now-sub-second-preserving helper expose those quirks. Correctly decoding pre-1970 stripe-stats sub-second is left as a follow-up.BE-only; FE and the pre-split pipeline are not touched.
Behavior change: No — values only, correcting a lossy result (the old output was a bug, not intended behavior); no SQL syntax / config / interface change. Same rationale as the companion Parquet fix #75207.
Tests
be/test/formats/orc/orc_chunk_reader_test.cpp—TestTimestampPreEpochSubSeconddrives the realOrcChunkReaderload path: pre-1970 sub-second on the plain (.500000,.123456) and instant (.500000) paths, plus whole-second-negative and post-1970 sub-second as no-regression guards. Verified RED→GREEN on a real build (before the fix the sub-second cases dropped to.000000and the instant case dropped its fraction; after the fix all pass and the pre-existingTestTimestampstill passes).OrcMinMaxDecoderPreEpochTimestampDropsSubSecondcovers the decoder guard: a pre-1970 stripe-stats bound decodes to its whole second (sub-second dropped, pruning unchanged). BE module-boundary check clean.Companion to the merged Parquet fix #75207; together they cover pre-1970 sub-second temporal loads.
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
This is an automatic backport of pull request #75432 done by [Mergify](https://mergify.com).