Skip to content

[BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds#75543

Merged
dirtysalt merged 1 commit into
StarRocks:mainfrom
xiangguangyxg:be-orc-pre1970-stats-decoder
Jun 30, 2026
Merged

[BugFix] Fix ORC min/max timestamp stats decode for pre-1970 and sub-second bounds#75543
dirtysalt merged 1 commit into
StarRocks:mainfrom
xiangguangyxg:be-orc-pre1970-stats-decoder

Conversation

@xiangguangyxg

Copy link
Copy Markdown
Contributor

Why I'm doing:

The ORC stripe min/max statistics decoder produced TIMESTAMP pruning bounds that could exclude matching rows. For a negative-epoch (pre-1970) bound it dropped the sub-second entirely; it never undid the ORC nanos +1 serialization offset (so an absent maximum-nanos understated the upper bound by up to ~1 ms); it split the milliseconds with truncating division (a negative remainder for pre-1970 values); it ignored the reader timezone offset on the before-epoch instant branch; and it truncated nanoseconds to microseconds in both directions. Any of these can shrink [min, max] below the true value range, so predicate pushdown wrongly skips row groups/stripes and drops rows.

This is the stripe-stats (pruning) counterpart to the row-load sub-second fixes #75432 (ORC) and #75207 (Parquet); that ORC PR left a placeholder in this decoder with a TODO, addressed here.

What I'm doing:

Decode each bound so [min, max] stays a superset of the true value range:

  • undo the ORC nanos +1 offset, falling back to the conservative default when the field is absent or malformed (0 for the minimum, 999999 for the maximum) — fixes the understated max for millisecond-precision files;
  • floor the millisecond-to-second split toward -inf so the remainder (and nanoseconds) stay non-negative for pre-1970 values;
  • fold the instant timezone offset into the seconds and decode as plain UTC (fixes the before-epoch TIMESTAMP_INSTANT branch that dropped the offset);
  • round the minimum down and the maximum up to microsecond precision (StarRocks is microsecond-precision, ORC is nanosecond-precision).

The two inline blocks are replaced by one shared helper. The conversion helpers in utils.h and the row-load path are unchanged.

Pre-1970 TIMESTAMP_INSTANT bounds in a named zone whose historical offset differs from its epoch offset remain approximate under the existing scalar-offset model (the row-load path uses a per-instant cctz conversion); this is a pre-existing limitation, documented in the decoder, and strictly better than the prior behavior which dropped the offset entirely.

Unit tests (RED to GREEN) in starrocks_test cover: pre-1970 sub-second preservation with min-floor/max-ceil, the +1 undo and absent/malformed defaults, the instant offset fold (including crossing the Unix epoch), and post-1970 / whole-second-negative no-regression.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

…second bounds

The ORC stripe min/max statistics decoder produced timestamp pruning
bounds that could exclude matching rows. For a negative-epoch bound it
dropped the sub-second entirely; it never undid the ORC nanos +1
serialization offset (so an absent maximum nanos understated the upper
bound by up to ~1 ms); it split the milliseconds with truncating
division (a negative remainder for pre-1970 values); it ignored the
reader timezone offset on the before-epoch instant branch; and it
truncated nanoseconds to microseconds in both directions.

Decode each bound so [min, max] stays a superset of the true value
range: undo the +1 nanos offset with the conservative default when the
field is absent or malformed (0 for the minimum, 999999 for the
maximum), floor the millisecond split toward -inf, fold the instant
timezone offset into the seconds and decode as plain UTC, and round the
minimum down and the maximum up to microsecond precision.

The shared conversion helpers in utils.h and the row-load path are
unchanged. Pre-1970 TIMESTAMP_INSTANT bounds in a named zone whose
historical offset differs from its epoch offset remain approximate
under the existing scalar-offset model, documented in the decoder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@github-actions github-actions Bot requested review from dirtysalt and trueeyu June 29, 2026 12:22

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd4ab791a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/formats/orc/orc_min_max_decoder.cpp
@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 22 / 22 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/formats/orc/orc_min_max_decoder.cpp 22 22 100.00% []

@dirtysalt dirtysalt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@dirtysalt dirtysalt enabled auto-merge (squash) June 29, 2026 21:41
@dirtysalt dirtysalt merged commit 13e7fdd into StarRocks:main Jun 30, 2026
70 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

@Mergifyio backport branch-4.1

@github-actions github-actions Bot removed the 4.1 label Jun 30, 2026
@mergify

mergify Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

backport branch-4.1

✅ Backports have been created

Details

@xiangguangyxg xiangguangyxg deleted the be-orc-pre1970-stats-decoder branch June 30, 2026 08:34
wanpengfei-git pushed a commit that referenced this pull request Jun 30, 2026
…second bounds (backport #75543) (#75589)

Signed-off-by: xiangguangyxg <xiangguangyxg@gmail.com>
Co-authored-by: xiangguangyxg <110401425+xiangguangyxg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants