Skip to content

[BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds (backport #75543)#75589

Merged
wanpengfei-git merged 1 commit into
branch-4.1from
mergify/bp/branch-4.1/pr-75543
Jun 30, 2026
Merged

[BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds (backport #75543)#75589
wanpengfei-git merged 1 commit into
branch-4.1from
mergify/bp/branch-4.1/pr-75543

Conversation

@mergify

@mergify mergify Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Why I'm doing:

The ORC stripe min/max statistics decoder produced TIMESTAMP pruning bounds that could exclude matching rows. For a negative-epoch (pre-1970) bound it dropped the sub-second entirely; it never undid the ORC nanos +1 serialization offset (so an absent maximum-nanos understated the upper bound by up to ~1 ms); it split the milliseconds with truncating division (a negative remainder for pre-1970 values); it ignored the reader timezone offset on the before-epoch instant branch; and it truncated nanoseconds to microseconds in both directions. Any of these can shrink [min, max] below the true value range, so predicate pushdown wrongly skips row groups/stripes and drops rows.

This is the stripe-stats (pruning) counterpart to the row-load sub-second fixes #75432 (ORC) and #75207 (Parquet); that ORC PR left a placeholder in this decoder with a TODO, addressed here.

What I'm doing:

Decode each bound so [min, max] stays a superset of the true value range:

  • undo the ORC nanos +1 offset, falling back to the conservative default when the field is absent or malformed (0 for the minimum, 999999 for the maximum) — fixes the understated max for millisecond-precision files;
  • floor the millisecond-to-second split toward -inf so the remainder (and nanoseconds) stay non-negative for pre-1970 values;
  • fold the instant timezone offset into the seconds and decode as plain UTC (fixes the before-epoch TIMESTAMP_INSTANT branch that dropped the offset);
  • round the minimum down and the maximum up to microsecond precision (StarRocks is microsecond-precision, ORC is nanosecond-precision).

The two inline blocks are replaced by one shared helper. The conversion helpers in utils.h and the row-load path are unchanged.

Pre-1970 TIMESTAMP_INSTANT bounds in a named zone whose historical offset differs from its epoch offset remain approximate under the existing scalar-offset model (the row-load path uses a per-instant cctz conversion); this is a pre-existing limitation, documented in the decoder, and strictly better than the prior behavior which dropped the offset entirely.

Unit tests (RED to GREEN) in starrocks_test cover: pre-1970 sub-second preservation with min-floor/max-ceil, the +1 undo and absent/malformed defaults, the instant offset fold (including crossing the Unix epoch), and post-1970 / whole-second-negative no-regression.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

…second bounds (#75543)

Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 13e7fdd)
@wanpengfei-git wanpengfei-git merged commit c37932d into branch-4.1 Jun 30, 2026
39 of 40 checks passed
@wanpengfei-git wanpengfei-git deleted the mergify/bp/branch-4.1/pr-75543 branch June 30, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants