-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29201: Fix masking and re-enable query_iceberg_metadata_of_unpartitioned_table.q #6075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@check-spelling-bot Report🔴 Please reviewSee the files view or the action log for details. Unrecognized words (3)bucketedtables Previously acknowledged words that are now absentaarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyyTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the [email protected]:armitage420/hive.git repository If the flagged items do not appear to be textIf items relate to a ...
|
5317723 to
17de6b9
Compare
@check-spelling-bot Report🔴 Please reviewSee the files view or the action log for details. Unrecognized words (3)bucketedtables Previously acknowledged words that are now absentaarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyyTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the [email protected]:armitage420/hive.git repository If the flagged items do not appear to be textIf items relate to a ...
|
|
I would rather not select columns that are going to be masked |
|
@deniskuzZ Not selecting masked columns is not feasible for this particular test, as the there are part of the column values are masked and not the column(related to metadata itself) as a whole. |
oh, ok.
why would they change? |
Thank you for your time @deniskuzZ ! Total size properties might change with file format upgrade, and in our case, it's orc here. Here's the jira for reference: HIVE-25607 Followed by the above mentioned jira, there was another jira that introduced masking for the same reason in iceberg qfiles: HIVE-25658 |
|
I had a look at this flaky test, too. If you look at the expected query result, the first columns are the same, but the first different column is not sorted lexicographically: The problem is that it is sorted on the original value of Changing the query to make this deterministic is a workaround for this particular q file. A proposal for a more general fix: refactor the masking so that it is done before the sorting ( |
|
@thomasrebele Thank you for your input! @deniskuzZ @thomasrebele Do let me know what both of you think! |
|
I've been working on a draft of applying the masking before the sorting (in addition to applying the masking at the end of the processing) in https://github.com/thomasrebele/hive/tree/tr/HIVE-29201-v1. The design of |
|
i think masking in this specific test isn’t very effective, as it bypasses validation for several iceberg metadata fields |
The masking is only done for HDFS paths, file_size_in_bytes and total file size of table properties, the masking doesn't really effect the validation of the test. |
@armitage420 test adds some additional masking as well, try removing and see for yourself. Why do we mask row count instead of size_in_bytes? |
My bad, I have fixed the wrong masking related to file format. I have also masked the file_size_in_bytes columns for the non-json rows that were missed. Although fixing this has no effect on the flaky nature of the output for this qfile. |
|
Hey everyone, many thanks for working on this. Many recent runs are affected by this flaky test that makes our CI unstable. Since we haven't fully converged on the solution is it reasonable to disable this temporarily till the fix finally lands? PR for disabling the test: #6098 |
True, however, now we have tons of disabled, flaky functional tests no one is looking at. By the way, I couldn’t find any recent master branch builds where this test failed |
|
@armitage420, @thomasrebele, I agree that a better approach would be to apply masking before sorting the results. @armitage420, just wondering, was the test failing a few months back? |
|
@deniskuzZ I had only seen this flakiness once in one of the PRs. While investigating, I found this JIRA that addressed a similar issue across Iceberg q-tests: HIVE-25607 I have also encountered other flaky tests (which seem fairly common in PRs run, although unrelated to current one) and created JIRAs in case anyone wants to address them: HIVE-29158, HIVE-29157 Regarding the approach of masking before sorting, I have some concerns:
|
|
@armitage420, i would call this test broken (by some library version bump, etc), not flaky. |
|
@armitage420: the QOutProcessor#maskPatterns applies the masking by copying the q.out file, reading the copy and overwriting the old file. Applying the masking on the string in memory will be much faster. I've created HIVE-29227 to remove the masking from the post processing. |
|
HIVE-29226 is merged |
|
UPD: rebased and fixed masking |
|
@deniskuzZ Thanks for your help :D |
|
deniskuzZ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1



What changes were proposed in this pull request?
Added explicit order by in queries
Why are the changes needed?
The test in itself uses SORT_QUERY_RESULTS to keep a deterministic ordering of queries. But there is still scope of non determinism, as SORT_QUERY_RESULTS sorts the output of each query lexicographically on unmasked rows, and if the present masked values change, then the output ordering changes as well. Hence, we need to add explicit order by on queries.
Does this PR introduce any user-facing change?
No
How was this patch tested?
q.out file results, and test pipeline