Skip to content

Adding basic structure for both ExternalDecryptionConfig and External…#38

Merged
sofia-tekdatum merged 3 commits intodev_phase2from
external_decryption_config
Aug 11, 2025
Merged

Adding basic structure for both ExternalDecryptionConfig and External…#38
sofia-tekdatum merged 3 commits intodev_phase2from
external_decryption_config

Conversation

@sofia-tekdatum
Copy link
Copy Markdown
Collaborator

Add the basic structs and classes for ExternalDecryptionConfig and ExternalFileDecryptionProperties.

Following definitions in Internal Config and Properties doc.

Also adding the ParquetCipher::type to the ColumnDecryptionProperties.

This change is analogous to the work already done for EncryptionConfig and FileProperties.

Leaving unit testing of the new structs for the next PR when the changes to the CryptoFactory are made.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 8, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Copy Markdown
Collaborator

@argmarco-tkd argmarco-tkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, left a few comments with one question around ColumnDecryptionProperties

/// enforce robust access control. The values sent to the external service depend on each
/// implementation.
/// This value must be a valid JSON-formatted string.
/// Validation of the string will be done by the external decryption service, Arrow will only
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not always a service :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed everywhere.

/// For security, these values should never be sent in this config, only the locations of
/// the files that the external service will know how to access.
std::unordered_map<ParquetCipher::type, std::unordered_map<std::string, std::string>>
connection_config;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be "instantiation_config" :) (not every external agent will be network based). But again, I'm OK with a later conversation on the topic (independent of this PR)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punting for later discussion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do a global renaming if we find it necessary. Can be an isolated change.

Comment on lines +272 to +275
ColumnDecryptionProperties::ColumnDecryptionProperties(
const std::string& column_path, const std::string& key,
std::optional<ParquetCipher::type> parquet_cipher)
: column_path_(column_path), parquet_cipher_(parquet_cipher) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it that parquet_cipher is optional?

I understand if the config (passed from the App, IIRC) does not have a per-column cipher already in-place, but when we resolve the properties, it should be part of each column, isn't it? (maybe I'm missing something).

If it must not be optional, I believe the proper order of the params would be column_path, parquet_cipher, key.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought about propagating the parquet_cipher to each column even if no per_column_encryption was specified, but decided against it.

We don't want to change too much Arrow's current way of life, and the parquet_cipher is defined somewhere else for all these cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got that. But it reads a bit weird that parquet_cipher is optional. Could you add a comment why this is? (basically that is an overwrite of the file-level algorithm value) <= right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In encryption.h I already have put a comment above its definition saying that if the value is not set, then the ParquetCipher defined in the FileEncryptionProperties or the InternalFileDecryptor will be used.

I added a bit more comment to make it clearer though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this PR has been merged, but I think we need a bit of discussion still. We're conflating two dimensions here: (a) how users express the configuration (i.e. how configuration is generated by the App), (b) how the configuration is 'parsed' into an Arrow-usable object. Having the hierarchical stuff at the App level is fine IMO. Having to understand the hierarchical stuff "everywhere" that the config is read, not so OK IMO.

ExternalFileDecryptionProperties::Builder* ExternalFileDecryptionProperties::Builder::app_context(
const std::string& context) {
if (!app_context_.empty()) {
throw ParquetException("App context already set");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one-line comment to say in what conditions could the app_context hit this condition. Basically what this is guarding against.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also following Arrow's builder patterns, it only allows the builder to set a particular property once.

private:
const std::string column_path_;
std::string key_;
std::optional<ParquetCipher::type> parquet_cipher_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe the comment on the overwrite could go here. (Please 2-check my assumption that is this actually the column-level overwrite of the file-level attribute. If it's something else, let me know)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are above the builder attributes (following the rest of the code).

Copy link
Copy Markdown
Collaborator

@avalerio-tkd avalerio-tkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. We can quickly chat offline if it's quicker.

Copy link
Copy Markdown
Collaborator

@avalerio-tkd avalerio-tkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Thanks.

@sofia-tekdatum sofia-tekdatum merged commit 3644690 into dev_phase2 Aug 11, 2025
22 of 61 checks passed
@sofia-tekdatum sofia-tekdatum deleted the external_decryption_config branch August 11, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants