Skip to content

Conversation

@liushengxuan
Copy link
Collaborator

@liushengxuan liushengxuan commented Jan 9, 2026

What problem does this PR solve?

Issue Number: close #63

Type of Change

  • πŸ› Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • πŸš€ Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • πŸ”¨ Refactoring (no logic changes)
  • πŸ”§ Build/CI or Infrastructure changes
  • πŸ“ Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@liushengxuan liushengxuan changed the title WIP Introduce Parquet Decryptor Support Paruqet Reader Decryption Jan 21, 2026
@liushengxuan liushengxuan force-pushed the shengxuan_decryption branch 4 times, most recently from f5d8c51 to dea8c74 Compare January 22, 2026 15:46
This commit introduces the decryption feature for Parquet reader. To use
the decryption feature, you need to implement the KMS class for key
retrieval.
Copy link
Collaborator

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes @liushengxuan ! This is a large PR, so for this round I only looked over the newly-added code for now. It also seems there is still some CI jobs which failed. I will hold off on another round of review until it is passing, and then review the remaining parts.

Most comments I had are minor, but I had a few higher level ones in there about the structure of the code to support Both CTR and GCM. Let me know what you think

Comment on lines +554 to +555
std::shared_ptr<::parquet::FileDecryptionProperties::Builder>
fileDecryptionPropertiesBuilder_{nullptr};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be initialized to nullptr?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And is there a reason this is initialized here rather than the constructor?

Comment on lines +567 to +569
fileSchema(nullptr) {
cryptoFactory_ = std::make_shared<::parquet::encryption::CryptoFactory>();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no args are required we can use the initializer list?

Suggested change
fileSchema(nullptr) {
cryptoFactory_ = std::make_shared<::parquet::encryption::CryptoFactory>();
}
fileSchema(nullptr),
cryptoFactory_(std::make_shared<::parquet::encryption::CryptoFactory>())
{ }

return *this;
}

// Set default Footer Key for Parquet Decryption. If the Footer Key is empty,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if setting to empty string is a good idea. I feel like this would be a good place to use std::optional to denote the presence of a key or not. This way we don't need to check for some kind of special value of the string itself.


std::shared_ptr<::parquet::FileDecryptionProperties::Builder>
getFileDecryptionPropertiesBuilder() const {
return fileDecryptionPropertiesBuilder_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we comment that this could potentially be null?

ReaderOptions& setFooterDecryptionKey(std::string footerKey) {
// fileDecryptionPropertiesBuilder_ is initiated when footer key or column
// keys are set directly.
if (fileDecryptionPropertiesBuilder_ == nullptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just initialize the decryptionPropertiesBuilder in the constructor? Why do we lazily initialize it?

Comment on lines +41 to +50
/// \brief Constructor function of AesDecryptor.
///
/// \param encryptionType the encryption algorithm to use.
/// \param keyLen can only serve one key length. Possible values: 16, 24, 32
/// bytes. \param hasMetadataDecryptor if true then this is a metadata
/// decryptor. \param containsLength If it is true, expect ciphertext length
/// prepended to the ciphertext.
explicit AesDecryptor(
ParquetCipher::type algId,
bool metadata,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't contain documentation on the metadata field

ctx_ = nullptr;
lengthBufferLength_ = containsLength ? kBufferSizeLength : 0;
ciphertextSizeDelta_ = lengthBufferLength_ + arrow::encryption::kNonceLength;
if (metadata || (ParquetCipher::AES_GCM_V1 == algId)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only use of this metadata parameter? Can we not just provide a specialized constructor/factory method which defaults to using AES_GCM_V1 algorithm? Call it something like createAesMetadataDecryptor?

Comment on lines +37 to +38
uint32_t clen = *len;
int64_t allocateSize = clen - decryptor->ciphertextSizeDelta();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const? Also why not just use *len directly on L38?

template <class T>
bool DeserializeMessage(
const uint8_t* buf,
uint32_t* len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a constant value? Also is there a reason to use uint32_t* instead of uint32_t?

template <class T>
void DeserializeUnencryptedMessage(
const uint8_t* buf,
uint32_t* len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question on using this pointer as opposed to the value directly

@ZacBlanco ZacBlanco changed the title Support Paruqet Reader Decryption Support Parquet Reader Decryption Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Reading Encrypted Parquet Files

2 participants