Skip to content

fix(ut,orc): fix orc read ut under debian and adjust some options#37

Closed
SGZW wants to merge 6 commits into
alibaba:mainfrom
SGZW:fix_orc_ts
Closed

fix(ut,orc): fix orc read ut under debian and adjust some options#37
SGZW wants to merge 6 commits into
alibaba:mainfrom
SGZW:fix_orc_ts

Conversation

@SGZW

@SGZW SGZW commented Dec 26, 2025

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #xxx

  1. src/paimon/format/orc/orc_file_batch_reader.cpp
refer: https://github.com/apache/arrow/pull/34591/files,  
// Orc timestamp type is error-prone since it serializes values in the writer timezone
// and reads them back in the reader timezone. To avoid this, both the Apache Orc C++
// writer and reader set the timezone to GMT by default to avoid any conversion.
// We follow the same practice here explicitly to make sure readers are aware of this.
  1. rest of fix:
// refer: https://github.com/eggert/tz/blob/main/asia#L653
// When using the Asia/Shanghai timezone under Debian, timestamps prior to 1901 have an
// additional offset of 5 minutes and 43 seconds

The specific verification steps are as follows: I modified the code related to ORC format and added some debug logs, focusing primarily on the code for timestamp time zone conversion. The details are as follows:

  void TimestampColumnReader::next(ColumnVectorBatch& rowBatch, uint64_t numValues, char* notNull) {
    ColumnReader::next(rowBatch, numValues, notNull);
    notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
    TimestampVectorBatch& timestampBatch = dynamic_cast<TimestampVectorBatch&>(rowBatch);
    int64_t* secsBuffer = timestampBatch.data.data();
    secondsRle_->next(secsBuffer, numValues, notNull);
    int64_t* nanoBuffer = timestampBatch.nanoseconds.data();
    nanoRle_->next(nanoBuffer, numValues, notNull);

    // Construct the values
    for (uint64_t i = 0; i < numValues; i++) {
      if (notNull == nullptr || notNull[i]) {
        uint64_t zeros = nanoBuffer[i] & 0x7;
        nanoBuffer[i] >>= 3;
        if (zeros != 0) {
          for (uint64_t j = 0; j <= zeros; ++j) {
            nanoBuffer[i] *= 10;
          }
        }

        // ORC-306: compensate -1s for JDK bug in java.sql.Timestamp
        int64_t writerTime = secsBuffer[i] + epochOffset_;
        if (writerTime < 0 && nanoBuffer[i] > 999999) {
            writerTime -= 1;
        }
        if (!sameTimezone_) {
          std::stringstream s1,s2;
          writerTimezone_->print(s1);
          readerTimezone_->print(s2);
          std::cerr << "### writer zone ### \n" << s1.str() << std::endl;
          std::cerr << "### reader zone ### \n" << s2.str() << std::endl;
          // adjust timestamp value to same wall clock time if writer and reader
          // time zones have different rules, which is required for Apache Orc.
          const auto& wv = writerTimezone_->getVariant(writerTime);
          const auto& rv = readerTimezone_->getVariant(writerTime);
          std::cerr << "wv: " << wv.toString() << ", rv: " << rv.toString() << std::endl;
          if (!wv.hasSameTzRule(rv)) {
            // If the timezone adjustment moves the millis across a DST boundary,
            // we need to reevaluate the offsets.
            int64_t adjustedTime = writerTime + wv.gmtOffset - rv.gmtOffset;
            const auto& adjustedReader = readerTimezone_->getVariant(adjustedTime);
            writerTime = writerTime + wv.gmtOffset - adjustedReader.gmtOffset;
          }
        }
        std::cerr << "epochOffset_: " << epochOffset_ << std::endl;
        std::cerr << "secsBuffer[i]: " << secsBuffer[i] << ", writerTime: " << writerTime << std::endl;
        secsBuffer[i] = writerTime;
      }
    }
  }

In the Debian environment, I used Ubuntu 24.04's tzdata (wget http://security.ubuntu.com/ubuntu/pool/main/t/tzdata/tzdata_2025b-0ubuntu0.24.04.1_all.deb) and Debian's tzdata for the TZDIR environment variable separately, with the results as follows.

Debian:

### writer zone ###
Timezone file: /usr/share/zoneinfo/Asia/Shanghai
  Version: 2
  Future rule: CST-8
  standard CST 28800
  Variant 0: LMT 29143
  Variant 1: CDT 32400 (dst)
  Variant 2: CST 28800
  Transition: null (-576460752303423488) -> LMT
  Transition: 1900-12-31 15:54:17 (-2177481943) -> CST
  Transition: 1919-04-12 16:00:00 (-1600675200) -> CDT
  Transition: 1919-09-30 15:00:00 (-1585904400) -> CST
  Transition: 1940-05-31 16:00:00 (-933667200) -> CDT
  Transition: 1940-10-12 15:00:00 (-922093200) -> CST
  Transition: 1941-03-14 16:00:00 (-908870400) -> CDT
  Transition: 1941-11-01 15:00:00 (-888829200) -> CST
  Transition: 1942-01-30 16:00:00 (-881049600) -> CDT
  Transition: 1945-09-01 15:00:00 (-767869200) -> CST
  Transition: 1946-05-14 16:00:00 (-745833600) -> CDT
  Transition: 1946-09-30 15:00:00 (-733827600) -> CST
  Transition: 1947-04-14 16:00:00 (-716889600) -> CDT
  Transition: 1947-10-31 15:00:00 (-699613200) -> CST
  Transition: 1948-04-30 16:00:00 (-683884800) -> CDT
  Transition: 1948-09-30 15:00:00 (-670669200) -> CST
  Transition: 1949-04-30 16:00:00 (-652348800) -> CDT
  Transition: 1949-05-27 15:00:00 (-650019600) -> CST
  Transition: 1986-05-03 18:00:00 (515527200) -> CDT
  Transition: 1986-09-13 17:00:00 (527014800) -> CST
  Transition: 1987-04-11 18:00:00 (545162400) -> CDT
  Transition: 1987-09-12 17:00:00 (558464400) -> CST
  Transition: 1988-04-16 18:00:00 (577216800) -> CDT
  Transition: 1988-09-10 17:00:00 (589914000) -> CST
  Transition: 1989-04-15 18:00:00 (608666400) -> CDT
  Transition: 1989-09-16 17:00:00 (621968400) -> CST
  Transition: 1990-04-14 18:00:00 (640116000) -> CDT
  Transition: 1990-09-15 17:00:00 (653418000) -> CST
  Transition: 1991-04-13 18:00:00 (671565600) -> CDT
  Transition: 1991-09-14 17:00:00 (684867600) -> CST

### reader zone ###
Timezone file: /usr/share/zoneinfo/GMT
  Version: 2
  Future rule: GMT0
  standard GMT 0
  Variant 0: GMT 0
  Transition: null (-576460752303423488) -> GMT

wv: LMT 29143, rv: GMT 0
epochOffset_: 1420041600
secsBuffer[i]: -3660591639, writerTime: -2240520897

Ubuntu 24.04

### writer zone ###
Timezone file: /home/zhangwei.95/zoneinfo/usr/share/zoneinfo//Asia/Shanghai
  Version: 2
  Future rule: CST-8
  standard CST 28800
  Variant 0: LMT 29143
  Variant 1: CDT 32400 (dst)
  Variant 2: CST 28800
  Transition: 1900-12-31 15:54:17 (-2177481943) -> CST
  Transition: 1919-04-12 16:00:00 (-1600675200) -> CDT
  Transition: 1919-09-30 15:00:00 (-1585904400) -> CST
  Transition: 1940-05-31 16:00:00 (-933667200) -> CDT
  Transition: 1940-10-12 15:00:00 (-922093200) -> CST
  Transition: 1941-03-14 16:00:00 (-908870400) -> CDT
  Transition: 1941-11-01 15:00:00 (-888829200) -> CST
  Transition: 1942-01-30 16:00:00 (-881049600) -> CDT
  Transition: 1945-09-01 15:00:00 (-767869200) -> CST
  Transition: 1946-05-14 16:00:00 (-745833600) -> CDT
  Transition: 1946-09-30 15:00:00 (-733827600) -> CST
  Transition: 1947-04-14 16:00:00 (-716889600) -> CDT
  Transition: 1947-10-31 15:00:00 (-699613200) -> CST
  Transition: 1948-04-30 16:00:00 (-683884800) -> CDT
  Transition: 1948-09-30 15:00:00 (-670669200) -> CST
  Transition: 1949-04-30 16:00:00 (-652348800) -> CDT
  Transition: 1949-05-27 15:00:00 (-650019600) -> CST
  Transition: 1986-05-03 18:00:00 (515527200) -> CDT
  Transition: 1986-09-13 17:00:00 (527014800) -> CST
  Transition: 1987-04-11 18:00:00 (545162400) -> CDT
  Transition: 1987-09-12 17:00:00 (558464400) -> CST
  Transition: 1988-04-16 18:00:00 (577216800) -> CDT
  Transition: 1988-09-10 17:00:00 (589914000) -> CST
  Transition: 1989-04-15 18:00:00 (608666400) -> CDT
  Transition: 1989-09-16 17:00:00 (621968400) -> CST
  Transition: 1990-04-14 18:00:00 (640116000) -> CDT
  Transition: 1990-09-15 17:00:00 (653418000) -> CST
  Transition: 1991-04-13 18:00:00 (671565600) -> CDT
  Transition: 1991-09-14 17:00:00 (684867600) -> CST

### reader zone ###
Timezone file: /home/zhangwei.95/zoneinfo/usr/share/zoneinfo//GMT
  Version: 2
  Future rule: GMT0
  standard GMT 0
  Variant 0: GMT 0

wv: CST 28800, rv: GMT 0
epochOffset_: 1420041600
secsBuffer[i]: -3660591639, writerTime: -2240521240

Analysis

The result is self-evident: Debian’s tzdata contains an extra LMT Timezone entry, which ultimately leads to a 5 minutes and 30 seconds offset in the final result.

Root Cause of the Issue: ORC Timestamp conversion has bugs

First, it is essential to understand how ORC stores Timestamps:

During Writing

What is stored is the second offset relative to epoch_.

Calculation of epoch_

cpptime_t utcEpoch = timegm(&epochStruct);  // 2015-01-01 00:00:00 UTC
epoch_ = utcEpoch - getVariant(utcEpoch).gmtOffset;

A critical issue here is:epoch_ is calculated using the time zone variant offset at the moment of 2015-01-01.

Core Conflict

cppwriterTime = secsBuffer[i] + epochOffset_; This formula implies the following assumption:secsBuffer[i] is an offset relative to a "local time epoch",yet this "local time epoch" uses the time zone rules valid as of 2015-01-01.
Errors occur when the actual time point (epoch + secsBuffer[i]) applies different time zone rules.

Fix
For the time being, I have bypassed this issue by detecting the OS version. To resolve the problem at its root, we need to ensure that the writer zone of the ORC file is set to the UTC/GMT time zone for data writing, thus avoiding similar issues.

Tests

API and Format

Documentation

@SGZW

SGZW commented Dec 26, 2025

Copy link
Copy Markdown
Contributor Author

@lucasfang @lxy-9602 @ChaomingZhangCN PTAL, thanks!

@lucasfang

Copy link
Copy Markdown
Collaborator

Nice and thorough work! We should check if such changes would introduce any compatibility issues, and whether the results are consistent with the behavior of Java Paimon. @lxy-9602 @zjw1111 @lszskye

@SGZW

SGZW commented Dec 30, 2025

Copy link
Copy Markdown
Contributor Author

Nice and thorough work! We should check if such changes would introduce any compatibility issues, and whether the results are consistent with the behavior of Java Paimon. @lxy-9602 @zjw1111 @lszskye

CI report errors as follows:

  Error: error: failed to fetch some objects from 'https://github.com/alibaba/paimon-cpp.git/info/lfs'
  The process '/usr/bin/git' failed with exit code 2
  Waiting 16 seconds before trying again
  /usr/bin/git lfs fetch origin refs/remotes/pull/37/merge
  Fetching reference refs/remotes/pull/37/merge
  batch response: This repository exceeded its LFS budget. The account responsible for the budget should increase it to restore access.
  Error: error: failed to fetch some objects from 'https://github.com/alibaba/paimon-cpp.git/info/lfs'
  Error: The process '/usr/bin/git' failed with exit code 2

@lucasfang

Copy link
Copy Markdown
Collaborator

Nice and thorough work! We should check if such changes would introduce any compatibility issues, and whether the results are consistent with the behavior of Java Paimon. @lxy-9602 @zjw1111 @lszskye

CI report errors as follows:

  Error: error: failed to fetch some objects from 'https://github.com/alibaba/paimon-cpp.git/info/lfs'
  The process '/usr/bin/git' failed with exit code 2
  Waiting 16 seconds before trying again
  /usr/bin/git lfs fetch origin refs/remotes/pull/37/merge
  Fetching reference refs/remotes/pull/37/merge
  batch response: This repository exceeded its LFS budget. The account responsible for the budget should increase it to restore access.
  Error: error: failed to fetch some objects from 'https://github.com/alibaba/paimon-cpp.git/info/lfs'
  Error: The process '/usr/bin/git' failed with exit code 2

Problem solved.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses ORC timestamp reading issues caused by timezone data differences between Debian and other Linux distributions, specifically for timestamps prior to 1901 when using the Asia/Shanghai timezone.

Key changes:

  • Sets the ORC reader timezone to GMT to avoid timezone conversion issues during reading
  • Adds OS detection utility to handle different expected test values on Debian vs other platforms
  • Updates test expectations to account for the 5 minutes and 43 seconds timezone offset present in Debian's timezone data

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/paimon/format/orc/orc_file_batch_reader.cpp Sets ORC reader timezone to GMT to prevent timezone conversion errors
src/paimon/testing/utils/testharness.h Adds OsReleaseDetector utility class to detect Debian OS
test/inte/read_inte_test.cpp Adds conditional test expectations for Debian vs non-Debian platforms; includes unintentional whitespace change
src/paimon/format/orc/orc_file_batch_reader_test.cpp Adds conditional test expectations for Debian vs non-Debian platforms
src/paimon/format/orc/complex_predicate_test.cpp Adds conditional test expectations and removes unused field declarations from SetUp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/testing/utils/testharness.h
Comment thread test/inte/read_inte_test.cpp Outdated
Comment on lines +245 to +246
// refer: https://github.com/apache/arrow/pull/34591
row_reader_options.setTimezoneName("GMT");

Copilot AI Dec 30, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While setting the reader timezone to GMT fixes the reading issue, the ORC writer should also be configured to write timestamps with GMT timezone to ensure consistency. Based on the PR description, this is the proper long-term solution. Consider adding writer_options.setTimezoneName("GMT"); in the PrepareWriterOptions method (around line 230 in orc_format_writer.cpp) to ensure both reading and writing use the same timezone, which would eliminate the need for OS-specific test expectations.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, C++ Paimon currently does not support user-configurable readerTimezone. When it is not configured, the default setting is already GMT, so there is no need to add this line here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. Your analysis is correct,But I believe we should explicitly set the timezone – this serves as a marker of best practice and also makes it easier for other maintainers to understand the code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll explicitly set timezone in reader&writer later

Comment thread src/paimon/testing/utils/testharness.h
@SGZW

SGZW commented Dec 30, 2025

Copy link
Copy Markdown
Contributor Author

@lucasfang PTAL and merge it.

@lucasfang

Copy link
Copy Markdown
Collaborator

@lucasfang PTAL and merge it.

Thanks a lot for the PR — awesome work! Since this change might affect compatibility, I’d like to get some more reviewers to weigh in before merging, just to make sure we’re on the safe side.

@SGZW

SGZW commented Dec 31, 2025

Copy link
Copy Markdown
Contributor Author

@lucasfang PTAL and merge it.

Thanks a lot for the PR — awesome work! Since this change might affect compatibility, I’d like to get some more reviewers to weigh in before merging, just to make sure we’re on the safe side.

Thanks. I've reviewed the Java implementation and confirmed it should be compatible. However, regarding the implementation for determining the OS distribution version, there is actually a more accurate approach we can adopt. We may refer to https://github.com/apache/orc/blob/main/c%2B%2B/src/Timezone.hh (which aligns with the implementation of Java's standard library) to precisely retrieve the variants of "Asia/Shanghai" as an auxiliary means of judgment.

@lxy-9602

lxy-9602 commented Jan 2, 2026

Copy link
Copy Markdown
Collaborator

Thank you very much for your thorough investigation and fix! I think the core of this issue lies in two aspects:

  1. paimon-cpp should explicitly set both the reader and writer timezones to GMT/UTC (even though this is already the current default);
  2. The failing tests all use data generated via the Java Paimon API, which should also be updated to write timestamps in GMT/UTC timezone.

@SGZW

SGZW commented Jan 4, 2026

Copy link
Copy Markdown
Contributor Author

Thank you very much for your thorough investigation and fix! I think the core of this issue lies in two aspects:

  1. paimon-cpp should explicitly set both the reader and writer timezones to GMT/UTC (even though this is already the current default);
  2. The failing tests all use data generated via the Java Paimon API, which should also be updated to write timestamps in GMT/UTC timezone.

Yes, you're absolutely right. The component responsible for writing ORC files should ensure this. Could we merge this PR for now? I can generate the new data in the next PR.

@lxy-9602

lxy-9602 commented Jan 4, 2026

Copy link
Copy Markdown
Collaborator

Yes, you're absolutely right. The component responsible for writing ORC files should ensure this. Could we merge this PR for now? I can generate the new data in the next PR.

We are generating the new data with correct Java API, as tests involving this DB are quite complex and have many dependencies. Additionally, in the new PR, we'll explicitly enforce GMT/UTC for read/write operations to avoid the issue mentioned above. Once the new PR is submitted, I'll @ you for review.

@SGZW

SGZW commented Jan 4, 2026

Copy link
Copy Markdown
Contributor Author

Yes, you're absolutely right. The component responsible for writing ORC files should ensure this. Could we merge this PR for now? I can generate the new data in the next PR.

We are generating the new data with correct Java API, as tests involving this DB are quite complex and have many dependencies. Additionally, in the new PR, we'll explicitly enforce GMT/UTC for read/write operations to avoid the issue mentioned above. Once the new PR is submitted, I'll @ you for review.

thanks, this PR can be closed for now.

@SGZW SGZW closed this Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants