Skip to content

GH-48334: Support reading encrypted bloom filters#49334

Draft
fenfeng9 wants to merge 2 commits intoapache:mainfrom
fenfeng9:parquet/gh-48334-encrypted-bloom-filter
Draft

GH-48334: Support reading encrypted bloom filters#49334
fenfeng9 wants to merge 2 commits intoapache:mainfrom
fenfeng9:parquet/gh-48334-encrypted-bloom-filter

Conversation

@fenfeng9
Copy link
Contributor

@fenfeng9 fenfeng9 commented Feb 18, 2026

Rationale for this change

Reading bloom filters from encrypted Parquet files previously raised an exception. This change implements encrypted bloom filter deserialization by decrypting the Thrift header (module id 8) and bitset (module id 9) separately, and adds the necessary validation and tests.

What changes are included in this PR?

  • Wire decryptor creation and AAD setup into the bloom filter reader
  • Add encrypted deserialization path to BlockSplitBloomFilter::Deserialize
  • Remove the fuzzer workaround that swallowed encrypted bloom filter exceptions

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@fenfeng9
Copy link
Contributor Author

fenfeng9 commented Feb 18, 2026

@pitrou Could you please take a look when you have a moment?

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

void WriteBloomFilter() {
if (bloom_filter_builder_ != nullptr) {
if (properties_->file_encryption_properties()) {
ParquetException::NYI("Encryption is not currently supported with bloom filter");
}
// Serialize bloom filter after all row groups have been written and report

Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up?

@pitrou
Copy link
Member

pitrou commented Feb 19, 2026

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)

Do you think we should add writer-side support for encrypted bloom filters in this PR as well, or handle that in a follow-up?

It depends if you feel comfortable doing it!

@fenfeng9
Copy link
Contributor Author

The C++ writer still rejects bloom filters when file encryption is enabled (ParquetException::NYI in file_writer.cc). Because of that, the tests here build an encrypted payload in memory to exercise the reader path.

How did you generate the encrypted payload? Ideally we should add a test file in https://github.com/apache/parquet-testing/tree/master/data (perhaps generated with another Parquet implementation?)

Thanks!
At the moment I didn’t run an end‑to‑end test. The test builds an encrypted payload in memory (serialize a BloomFilterHeader, encrypt header/bitset separately, then concatenate) to exercise the reader path without relying on the C++ writer.

I will update the tests with a real encrypted test file generated by another Parquet implementation and add it to parquet-testing.

@fenfeng9 fenfeng9 marked this pull request as draft February 19, 2026 08:40
@fenfeng9
Copy link
Contributor Author

Marked as draft. Will add a real test file to parquet-testing and ping you when ready for review.

@pitrou
Copy link
Member

pitrou commented Feb 19, 2026

Thanks for tackling this @fenfeng9 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments