[BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds (backport #75543)#75589
Merged
Conversation
23 tasks
xiangguangyxg
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I'm doing:
The ORC stripe min/max statistics decoder produced TIMESTAMP pruning bounds that could exclude matching rows. For a negative-epoch (pre-1970) bound it dropped the sub-second entirely; it never undid the ORC nanos
+1serialization offset (so an absent maximum-nanos understated the upper bound by up to ~1 ms); it split the milliseconds with truncating division (a negative remainder for pre-1970 values); it ignored the reader timezone offset on the before-epoch instant branch; and it truncated nanoseconds to microseconds in both directions. Any of these can shrink[min, max]below the true value range, so predicate pushdown wrongly skips row groups/stripes and drops rows.This is the stripe-stats (pruning) counterpart to the row-load sub-second fixes #75432 (ORC) and #75207 (Parquet); that ORC PR left a placeholder in this decoder with a TODO, addressed here.
What I'm doing:
Decode each bound so
[min, max]stays a superset of the true value range:+1offset, falling back to the conservative default when the field is absent or malformed (0 for the minimum, 999999 for the maximum) — fixes the understated max for millisecond-precision files;TIMESTAMP_INSTANTbranch that dropped the offset);The two inline blocks are replaced by one shared helper. The conversion helpers in
utils.hand the row-load path are unchanged.Pre-1970
TIMESTAMP_INSTANTbounds in a named zone whose historical offset differs from its epoch offset remain approximate under the existing scalar-offset model (the row-load path uses a per-instant cctz conversion); this is a pre-existing limitation, documented in the decoder, and strictly better than the prior behavior which dropped the offset entirely.Unit tests (RED to GREEN) in
starrocks_testcover: pre-1970 sub-second preservation with min-floor/max-ceil, the+1undo and absent/malformed defaults, the instant offset fold (including crossing the Unix epoch), and post-1970 / whole-second-negative no-regression.What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
This is an automatic backport of pull request [BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds #75543 done by Mergify.