GH-48334: Support reading encrypted bloom filters#49334
GH-48334: Support reading encrypted bloom filters#49334fenfeng9 wants to merge 2 commits intoapache:mainfrom
Conversation
|
@pitrou Could you please take a look when you have a moment? The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path. arrow/cpp/src/parquet/file_writer.cc Lines 484 to 489 in aea1ad3 Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up? |
How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)
It depends if you feel comfortable doing it! |
Thanks! I will update the tests with a real encrypted test file generated by another Parquet implementation and add it to parquet-testing. |
|
Marked as draft. Will add a real test file to parquet-testing and ping you when ready for review. |
|
Thanks for tackling this @fenfeng9 :) |
Rationale for this change
Reading bloom filters from encrypted Parquet files previously raised an exception. This change implements encrypted bloom filter deserialization by decrypting the Thrift header (module id 8) and bitset (module id 9) separately, and adds the necessary validation and tests.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.