feat!: Support compression codecs for Avro Files (including manifest and manifest lists)#1851
feat!: Support compression codecs for Avro Files (including manifest and manifest lists)#1851emkornfield wants to merge 39 commits intoapache:mainfrom
Conversation
Previously these properties where not honored on tabel properties. - Adds table properties for these values. - Plumbs them through for writers.
|
looks like a clippy issue in CI |
Yeah working on a fix. |
|
@liurenjie1024 @kevinjqliu would you have time to review? |
kevinjqliu
left a comment
There was a problem hiding this comment.
Generally LGTM, I left a few comments around using the same string as the java implementation
crates/iceberg/src/spec/avro_util.rs
Outdated
| /// | ||
| /// # Compression Levels | ||
| /// | ||
| /// The compression level mapping is based on miniz_oxide's CompressionLevel enum: |
There was a problem hiding this comment.
clarify this is for gzip.
There was a problem hiding this comment.
nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice
kevinjqliu
left a comment
There was a problem hiding this comment.
this looks great, thanks!
adding the "breaking" label since we're modifying a few public functions
|
@kevinjqliu thanks for the reviews, I also added more e2e tests for manifest and manifest list writers to confirm compression. |
I'm not sure that I can access labels as a contributor, I put a breaking change note in the PR description. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @emkornfield for this pr. In general, I think this is a useful feature implementation. But we should not mix two things in one pr, I would suggest to split them into two prs: one for table metadata, and one for manifests.
| fn parse_optional_property<T: std::str::FromStr>( | ||
| properties: &HashMap<String, String>, | ||
| key: &str, | ||
| ) -> Result<Option<T>, anyhow::Error> |
There was a problem hiding this comment.
We should use this crate's error.
There was a problem hiding this comment.
Do we really need this method? I think passing a None to default value in parse_property would be enough?
There was a problem hiding this comment.
I'm new to rust but would that work? Doesn't parse_property, always return an extension of String instead of an Option? It is useful to know if the value was actually configured (and thus the option return)?
There was a problem hiding this comment.
I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.
There was a problem hiding this comment.
I'll see if I can make this work with empty string.
There was a problem hiding this comment.
I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.
The problem here is that the default value is dependent on the value for compression, it seems baking this into the parsing layer is coupling two concerns (business logic for parsing) and the actual values. To eliminate this method I've used a sentinel value for the old method. Let me know if that is seems OK.
|
|
||
| // Helper function to parse a property from a HashMap | ||
| // If the property is not found, use the default value | ||
| fn parse_property<T: std::str::FromStr>( |
There was a problem hiding this comment.
this is cleanup because it sounds like the crate's errors are preferred I can revert this.
|
@liurenjie1024 split of this PR to just be about Avro. Also based on feedback, I cleaned up some additional imports/Errors that weren't using the Crate's error before. |
|
|
||
| #[tokio::test] | ||
| async fn test_manifest_list_writer_with_compression() { | ||
| use std::fs; |
There was a problem hiding this comment.
move these to top level.
Resolved conflict in crates/iceberg/src/spec/manifest/writer.rs by keeping both tests: - test_manifest_writer_with_compression (from fix_compression) - test_v3_delete_manifest_delete_file_roundtrip (from databricks/main) Also added missing Codec::Null parameter to ManifestWriterBuilder::new call in the incoming test.
| let level_raw = parse_property( | ||
| props, | ||
| TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL, | ||
| 255u8, |
There was a problem hiding this comment.
maybe make a constant?
|
Moving to draft until we finalize #1876 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Still waiting on Moving to draft until we finalize #1876, I'll resolve conflicts. |
## Which issue does this PR close? Split off from #1851 - Partially fixes #1731. ## What changes are included in this PR? This change honors the compression setting for metadata.json file (`write.metadata.compression-codec`). ## Are these changes tested? Add unit test to verify files are gzipped when the flag is enabled. BREAKING CHANGE: Make `write_to` take `MetadataLocation` --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> Co-authored-by: Xuanwo <github@xuanwo.io>
## Which issue does this PR close? This is an intermediate PR for #1731 I'm splitting out changes from #1851 to the compression codec to make it easier to review. Once we decide on approach here and merge it I'll update #1851 accordingly. ## What changes are included in this PR? - Add optional compression level to gzip and zstd (needed for when avro compression usage). - Add Snappy as a compression codec (also will be used for Avro) - Manually code up some previously auto-generated methods as a result. AI helped with an initial version of this PR. ## Are these changes tested? Additional unit tests
) ## Which issue does this PR close? Split off from apache#1851 - Partially fixes apache#1731. ## What changes are included in this PR? This change honors the compression setting for metadata.json file (`write.metadata.compression-codec`). ## Are these changes tested? Add unit test to verify files are gzipped when the flag is enabled. BREAKING CHANGE: Make `write_to` take `MetadataLocation` --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> Co-authored-by: Xuanwo <github@xuanwo.io>
Which issue does this PR close?
What changes are included in this PR?
Previously these properties where not honored on tabel properties.
Are these changes tested?
Added unit tests
BREAKING CHANGE: Adds codec parameter to some public functions. By default start compressing manifests and manifest lists.