Skip to content

GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API#34616

Merged
jorisvandenbossche merged 45 commits intoapache:mainfrom
tolleybot:dataset_encryption
Oct 11, 2023
Merged

GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API#34616
jorisvandenbossche merged 45 commits intoapache:mainfrom
tolleybot:dataset_encryption

Conversation

@tolleybot
Copy link
Contributor

@tolleybot tolleybot commented Mar 17, 2023

Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API. See https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit# for supporting document.

What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

Are there any user-facing changes?

Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Remember using clang-format to format cpp code

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 17, 2023
@tolleybot tolleybot marked this pull request as ready for review March 17, 2023 21:41
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this @tolleybot. I did a first pass looking through it, and it seems like we could use a bit more testing, both in C++ and in Python.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 20, 2023
@wjones127
Copy link
Member

On the C++ formatting: If you install clang-tools 14 (must be that version) and set the path the binaries with CLANG_TOOLS_PATH , then when you reconfigure CMake it will create a format target and a lint target that can be used to check style. (You'll know it's configured right when the CMake configure output says Found ClangTools: /some/path/bin/clang-format-14)

For Python formatting, see these instructions: https://arrow.apache.org/docs/developers/python.html#coding-style

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 22, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Mar 23, 2023
@jorisvandenbossche
Copy link
Member

Do you really need such a module? I'd just try to drop it completely. If you use Python references, you can just check them for None.

It turns out that indeed that is not needed at all (I originally had it, because it was a cimport, because I was using a C++ type, but after making it a def, I should also have removed the pxd ..). Thanks for the help @scoder!

(got it working after some more fun runtime errors, like "AttributeError: module 'pyarrow._parquet_encryption' has no attribute 'pyx_capi'", clean rebuilds!)

Version that seems to be working locally -> tolleybot#4

I keep forgetting the rules and options for this, but there are definitely ways to define the target name. In any case, Cython needs to know the final module name when generating the C/C++ code because it's part of the module's C-API. Renaming modules between in and out is usually not a good idea because it adds complexity to the build and makes things harder to find.

AFAIK I was using standard cython features to specify the result and source file name, like python -m cython --gdb --warning-errors --output-file .../_dataset_parquet_encryption.c .../_dataset_parquet_no_encryption.pyx. Although, that is using --output-file, and probably should also set --module-name, which wasn't done by our cmake wrapper around python -m cython.
(anyway, not needed here anymore, but it is a confusing error)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit -g python

@jorisvandenbossche

This comment has been minimized.

@github-actions
Copy link

Revision: d7a6b55

Submitted crossbow builds: ursacomputing/crossbow @ actions-f652f7cf6d

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.10-pandas-latest

@github-actions
Copy link

Revision: b323457

Submitted crossbow builds: ursacomputing/crossbow @ actions-6e2d0464d5

Task Status
test-conda-python-3.10-pandas-latest Github Actions

@anjakefala
Copy link
Contributor

I updated the branch with the latest main, so we can see if that helps with the unrelated failures.

@ianmcook
Copy link
Member

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: ced2ed2

Submitted crossbow builds: ursacomputing/crossbow @ actions-6cee43f1c5

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.5.0 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.5.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@pitrou
Copy link
Member

pitrou commented Oct 10, 2023

@github-actions crossbow submit verifypython*

@github-actions
Copy link

Revision: ced2ed2

Submitted crossbow builds: ursacomputing/crossbow @ actions-379629a2a0

Task Status
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions

@jorisvandenbossche
Copy link
Member

It seems that all triggered builds are green!

Thanks @tolleybot and all others involved to get this over the finish line

@tolleybot
Copy link
Contributor Author

Thanks, everyone for all the reviews and work on this!

@pitrou
Copy link
Member

pitrou commented Oct 11, 2023

Thank you @tolleybot for contributing this in the first place :-)

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0793432.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

FYI I checked the reports above and they are all flakes (the benchmarks reports are stable beyond this commit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API