Adding basic structure for both ExternalDecryptionConfig and External…#38
Adding basic structure for both ExternalDecryptionConfig and External…#38sofia-tekdatum merged 3 commits intodev_phase2from
Conversation
…FileDecryptionProperties
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
argmarco-tkd
left a comment
There was a problem hiding this comment.
mostly LGTM, left a few comments with one question around ColumnDecryptionProperties
| /// enforce robust access control. The values sent to the external service depend on each | ||
| /// implementation. | ||
| /// This value must be a valid JSON-formatted string. | ||
| /// Validation of the string will be done by the external decryption service, Arrow will only |
There was a problem hiding this comment.
not always a service :)
There was a problem hiding this comment.
Changed everywhere.
| /// For security, these values should never be sent in this config, only the locations of | ||
| /// the files that the external service will know how to access. | ||
| std::unordered_map<ParquetCipher::type, std::unordered_map<std::string, std::string>> | ||
| connection_config; |
There was a problem hiding this comment.
I still think this should be "instantiation_config" :) (not every external agent will be network based). But again, I'm OK with a later conversation on the topic (independent of this PR)
There was a problem hiding this comment.
Punting for later discussion.
There was a problem hiding this comment.
Yes, we can do a global renaming if we find it necessary. Can be an isolated change.
| ColumnDecryptionProperties::ColumnDecryptionProperties( | ||
| const std::string& column_path, const std::string& key, | ||
| std::optional<ParquetCipher::type> parquet_cipher) | ||
| : column_path_(column_path), parquet_cipher_(parquet_cipher) { |
There was a problem hiding this comment.
why is it that parquet_cipher is optional?
I understand if the config (passed from the App, IIRC) does not have a per-column cipher already in-place, but when we resolve the properties, it should be part of each column, isn't it? (maybe I'm missing something).
If it must not be optional, I believe the proper order of the params would be column_path, parquet_cipher, key.
There was a problem hiding this comment.
We thought about propagating the parquet_cipher to each column even if no per_column_encryption was specified, but decided against it.
We don't want to change too much Arrow's current way of life, and the parquet_cipher is defined somewhere else for all these cases.
There was a problem hiding this comment.
Got that. But it reads a bit weird that parquet_cipher is optional. Could you add a comment why this is? (basically that is an overwrite of the file-level algorithm value) <= right?
There was a problem hiding this comment.
In encryption.h I already have put a comment above its definition saying that if the value is not set, then the ParquetCipher defined in the FileEncryptionProperties or the InternalFileDecryptor will be used.
I added a bit more comment to make it clearer though.
There was a problem hiding this comment.
I know this PR has been merged, but I think we need a bit of discussion still. We're conflating two dimensions here: (a) how users express the configuration (i.e. how configuration is generated by the App), (b) how the configuration is 'parsed' into an Arrow-usable object. Having the hierarchical stuff at the App level is fine IMO. Having to understand the hierarchical stuff "everywhere" that the config is read, not so OK IMO.
| ExternalFileDecryptionProperties::Builder* ExternalFileDecryptionProperties::Builder::app_context( | ||
| const std::string& context) { | ||
| if (!app_context_.empty()) { | ||
| throw ParquetException("App context already set"); |
There was a problem hiding this comment.
Add one-line comment to say in what conditions could the app_context hit this condition. Basically what this is guarding against.
There was a problem hiding this comment.
This is also following Arrow's builder patterns, it only allows the builder to set a particular property once.
| private: | ||
| const std::string column_path_; | ||
| std::string key_; | ||
| std::optional<ParquetCipher::type> parquet_cipher_; |
There was a problem hiding this comment.
Or maybe the comment on the overwrite could go here. (Please 2-check my assumption that is this actually the column-level overwrite of the file-level attribute. If it's something else, let me know)
There was a problem hiding this comment.
The comments are above the builder attributes (following the rest of the code).
avalerio-tkd
left a comment
There was a problem hiding this comment.
Left a few comments. We can quickly chat offline if it's quicker.
avalerio-tkd
left a comment
There was a problem hiding this comment.
Discussed offline. Thanks.
Add the basic structs and classes for ExternalDecryptionConfig and ExternalFileDecryptionProperties.
Following definitions in Internal Config and Properties doc.
Also adding the ParquetCipher::type to the ColumnDecryptionProperties.
This change is analogous to the work already done for EncryptionConfig and FileProperties.
Leaving unit testing of the new structs for the next PR when the changes to the CryptoFactory are made.