fix(ut,orc): fix orc read ut under debian and adjust some options#37
fix(ut,orc): fix orc read ut under debian and adjust some options#37SGZW wants to merge 6 commits into
Conversation
|
@lucasfang @lxy-9602 @ChaomingZhangCN PTAL, thanks! |
CI report errors as follows: |
Problem solved. |
There was a problem hiding this comment.
Pull request overview
This PR addresses ORC timestamp reading issues caused by timezone data differences between Debian and other Linux distributions, specifically for timestamps prior to 1901 when using the Asia/Shanghai timezone.
Key changes:
- Sets the ORC reader timezone to GMT to avoid timezone conversion issues during reading
- Adds OS detection utility to handle different expected test values on Debian vs other platforms
- Updates test expectations to account for the 5 minutes and 43 seconds timezone offset present in Debian's timezone data
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/orc/orc_file_batch_reader.cpp | Sets ORC reader timezone to GMT to prevent timezone conversion errors |
| src/paimon/testing/utils/testharness.h | Adds OsReleaseDetector utility class to detect Debian OS |
| test/inte/read_inte_test.cpp | Adds conditional test expectations for Debian vs non-Debian platforms; includes unintentional whitespace change |
| src/paimon/format/orc/orc_file_batch_reader_test.cpp | Adds conditional test expectations for Debian vs non-Debian platforms |
| src/paimon/format/orc/complex_predicate_test.cpp | Adds conditional test expectations and removes unused field declarations from SetUp |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // refer: https://github.com/apache/arrow/pull/34591 | ||
| row_reader_options.setTimezoneName("GMT"); |
There was a problem hiding this comment.
While setting the reader timezone to GMT fixes the reading issue, the ORC writer should also be configured to write timestamps with GMT timezone to ensure consistency. Based on the PR description, this is the proper long-term solution. Consider adding writer_options.setTimezoneName("GMT"); in the PrepareWriterOptions method (around line 230 in orc_format_writer.cpp) to ensure both reading and writing use the same timezone, which would eliminate the need for OS-specific test expectations.
There was a problem hiding this comment.
In fact, C++ Paimon currently does not support user-configurable readerTimezone. When it is not configured, the default setting is already GMT, so there is no need to add this line here.
There was a problem hiding this comment.
Thank you very much. Your analysis is correct,But I believe we should explicitly set the timezone – this serves as a marker of best practice and also makes it easier for other maintainers to understand the code.
There was a problem hiding this comment.
ok. I'll explicitly set timezone in reader&writer later
|
@lucasfang PTAL and merge it. |
Thanks a lot for the PR — awesome work! Since this change might affect compatibility, I’d like to get some more reviewers to weigh in before merging, just to make sure we’re on the safe side. |
Thanks. I've reviewed the Java implementation and confirmed it should be compatible. However, regarding the implementation for determining the OS distribution version, there is actually a more accurate approach we can adopt. We may refer to https://github.com/apache/orc/blob/main/c%2B%2B/src/Timezone.hh (which aligns with the implementation of Java's standard library) to precisely retrieve the variants of "Asia/Shanghai" as an auxiliary means of judgment. |
|
Thank you very much for your thorough investigation and fix! I think the core of this issue lies in two aspects:
|
Yes, you're absolutely right. The component responsible for writing ORC files should ensure this. Could we merge this PR for now? I can generate the new data in the next PR. |
We are generating the new data with correct Java API, as tests involving this DB are quite complex and have many dependencies. Additionally, in the new PR, we'll explicitly enforce GMT/UTC for read/write operations to avoid the issue mentioned above. Once the new PR is submitted, I'll @ you for review. |
thanks, this PR can be closed for now. |
Purpose
Linked issue: close #xxx
The specific verification steps are as follows: I modified the code related to ORC format and added some debug logs, focusing primarily on the code for timestamp time zone conversion. The details are as follows:
In the Debian environment, I used Ubuntu 24.04's tzdata (wget http://security.ubuntu.com/ubuntu/pool/main/t/tzdata/tzdata_2025b-0ubuntu0.24.04.1_all.deb) and Debian's tzdata for the TZDIR environment variable separately, with the results as follows.
Debian:
Ubuntu 24.04
Analysis
The result is self-evident: Debian’s tzdata contains an extra LMT Timezone entry, which ultimately leads to a 5 minutes and 30 seconds offset in the final result.
Root Cause of the Issue: ORC Timestamp conversion has bugs
First, it is essential to understand how ORC stores Timestamps:
During Writing
What is stored is the second offset relative to epoch_.
Calculation of epoch_
A critical issue here is:epoch_ is calculated using the time zone variant offset at the moment of 2015-01-01.
Core Conflict
cppwriterTime = secsBuffer[i] + epochOffset_;This formula implies the following assumption:secsBuffer[i] is an offset relative to a "local time epoch",yet this "local time epoch" uses the time zone rules valid as of 2015-01-01.Errors occur when the actual time point (epoch + secsBuffer[i]) applies different time zone rules.
Fix
For the time being, I have bypassed this issue by detecting the OS version. To resolve the problem at its root, we need to ensure that the writer zone of the ORC file is set to the UTC/GMT time zone for data writing, thus avoiding similar issues.
Tests
API and Format
Documentation