-
Notifications
You must be signed in to change notification settings - Fork 45
Support Parquet Reader Decryption #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
22496e2 to
886e598
Compare
f5d8c51 to
dea8c74
Compare
This commit introduces the decryption feature for Parquet reader. To use the decryption feature, you need to implement the KMS class for key retrieval.
dea8c74 to
7c4ae3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes @liushengxuan ! This is a large PR, so for this round I only looked over the newly-added code for now. It also seems there is still some CI jobs which failed. I will hold off on another round of review until it is passing, and then review the remaining parts.
Most comments I had are minor, but I had a few higher level ones in there about the structure of the code to support Both CTR and GCM. Let me know what you think
| std::shared_ptr<::parquet::FileDecryptionProperties::Builder> | ||
| fileDecryptionPropertiesBuilder_{nullptr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be initialized to nullptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And is there a reason this is initialized here rather than the constructor?
| fileSchema(nullptr) { | ||
| cryptoFactory_ = std::make_shared<::parquet::encryption::CryptoFactory>(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since no args are required we can use the initializer list?
| fileSchema(nullptr) { | |
| cryptoFactory_ = std::make_shared<::parquet::encryption::CryptoFactory>(); | |
| } | |
| fileSchema(nullptr), | |
| cryptoFactory_(std::make_shared<::parquet::encryption::CryptoFactory>()) | |
| { } |
| return *this; | ||
| } | ||
|
|
||
| // Set default Footer Key for Parquet Decryption. If the Footer Key is empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if setting to empty string is a good idea. I feel like this would be a good place to use std::optional to denote the presence of a key or not. This way we don't need to check for some kind of special value of the string itself.
|
|
||
| std::shared_ptr<::parquet::FileDecryptionProperties::Builder> | ||
| getFileDecryptionPropertiesBuilder() const { | ||
| return fileDecryptionPropertiesBuilder_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we comment that this could potentially be null?
| ReaderOptions& setFooterDecryptionKey(std::string footerKey) { | ||
| // fileDecryptionPropertiesBuilder_ is initiated when footer key or column | ||
| // keys are set directly. | ||
| if (fileDecryptionPropertiesBuilder_ == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just initialize the decryptionPropertiesBuilder in the constructor? Why do we lazily initialize it?
| /// \brief Constructor function of AesDecryptor. | ||
| /// | ||
| /// \param encryptionType the encryption algorithm to use. | ||
| /// \param keyLen can only serve one key length. Possible values: 16, 24, 32 | ||
| /// bytes. \param hasMetadataDecryptor if true then this is a metadata | ||
| /// decryptor. \param containsLength If it is true, expect ciphertext length | ||
| /// prepended to the ciphertext. | ||
| explicit AesDecryptor( | ||
| ParquetCipher::type algId, | ||
| bool metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't contain documentation on the metadata field
| ctx_ = nullptr; | ||
| lengthBufferLength_ = containsLength ? kBufferSizeLength : 0; | ||
| ciphertextSizeDelta_ = lengthBufferLength_ + arrow::encryption::kNonceLength; | ||
| if (metadata || (ParquetCipher::AES_GCM_V1 == algId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only use of this metadata parameter? Can we not just provide a specialized constructor/factory method which defaults to using AES_GCM_V1 algorithm? Call it something like createAesMetadataDecryptor?
| uint32_t clen = *len; | ||
| int64_t allocateSize = clen - decryptor->ciphertextSizeDelta(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const? Also why not just use *len directly on L38?
| template <class T> | ||
| bool DeserializeMessage( | ||
| const uint8_t* buf, | ||
| uint32_t* len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a constant value? Also is there a reason to use uint32_t* instead of uint32_t?
| template <class T> | ||
| void DeserializeUnencryptedMessage( | ||
| const uint8_t* buf, | ||
| uint32_t* len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question on using this pointer as opposed to the value directly
What problem does this PR solve?
Issue Number: close #63
Type of Change
Description
Describe your changes in detail.
For complex logic, explain the "Why" and "How".
Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes