Skip to content

feat!: Support compression codecs for Avro Files (including manifest and manifest lists)#1851

Draft
emkornfield wants to merge 39 commits intoapache:mainfrom
emkornfield:fix_compression
Draft

feat!: Support compression codecs for Avro Files (including manifest and manifest lists)#1851
emkornfield wants to merge 39 commits intoapache:mainfrom
emkornfield:fix_compression

Conversation

@emkornfield
Copy link
Copy Markdown
Contributor

@emkornfield emkornfield commented Nov 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Previously these properties where not honored on tabel properties.

  • Adds table properties for these values.
  • Plumbs them through for writers.

Are these changes tested?

Added unit tests

BREAKING CHANGE: Adds codec parameter to some public functions. By default start compressing manifests and manifest lists.

Previously these properties where not honored on tabel properties.

- Adds table properties for these values.
- Plumbs them through for writers.
@kevinjqliu
Copy link
Copy Markdown
Contributor

looks like a clippy issue in CI

error: this function has too many arguments (8/7)
   --> crates/iceberg/src/spec/manifest_list.rs:243:5
    |
243 | /     fn new(
244 | |         format_version: FormatVersion,
245 | |         output_file: OutputFile,
246 | |         metadata: HashMap<String, String>,
...   |
251 | |         compression_level: u8,
252 | |     ) -> Self {
    | |_____________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#too_many_arguments
    = note: `-D clippy::too-many-arguments` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::too_many_arguments)]`

@emkornfield
Copy link
Copy Markdown
Contributor Author

looks like a clippy issue in CI

Yeah working on a fix.

@emkornfield
Copy link
Copy Markdown
Contributor Author

@liurenjie1024 @kevinjqliu would you have time to review?

Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, I left a few comments around using the same string as the java implementation

///
/// # Compression Levels
///
/// The compression level mapping is based on miniz_oxide's CompressionLevel enum:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify this is for gzip.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice

Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great, thanks!
adding the "breaking" label since we're modifying a few public functions

@emkornfield
Copy link
Copy Markdown
Contributor Author

@kevinjqliu thanks for the reviews, I also added more e2e tests for manifest and manifest list writers to confirm compression.

@emkornfield
Copy link
Copy Markdown
Contributor Author

adding the "breaking" label since we're modifying a few public functions

I'm not sure that I can access labels as a contributor, I put a breaking change note in the PR description.

Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr. In general, I think this is a useful feature implementation. But we should not mix two things in one pr, I would suggest to split them into two prs: one for table metadata, and one for manifests.

fn parse_optional_property<T: std::str::FromStr>(
properties: &HashMap<String, String>,
key: &str,
) -> Result<Option<T>, anyhow::Error>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use this crate's error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this method? I think passing a None to default value in parse_property would be enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to rust but would that work? Doesn't parse_property, always return an extension of String instead of an Option? It is useful to know if the value was actually configured (and thus the option return)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can make this work with empty string.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.

The problem here is that the default value is dependent on the value for compression, it seems baking this into the parsing layer is coupling two concerns (business logic for parsing) and the actual values. To eliminate this method I've used a sentinel value for the old method. Let me know if that is seems OK.


// Helper function to parse a property from a HashMap
// If the property is not found, use the default value
fn parse_property<T: std::str::FromStr>(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cleanup because it sounds like the crate's errors are preferred I can revert this.

@emkornfield emkornfield changed the title feat!: Support compression codecs for JSON metadata and Avro feat!: Support compression codecs for Avro Files (inlcuding manifest and manifest lists) Nov 19, 2025
@emkornfield
Copy link
Copy Markdown
Contributor Author

@liurenjie1024 split of this PR to just be about Avro. Also based on feedback, I cleaned up some additional imports/Errors that weren't using the Crate's error before.


#[tokio::test]
async fn test_manifest_list_writer_with_compression() {
use std::fs;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to top level.

Resolved conflict in crates/iceberg/src/spec/manifest/writer.rs by keeping both tests:
- test_manifest_writer_with_compression (from fix_compression)
- test_v3_delete_manifest_delete_file_roundtrip (from databricks/main)

Also added missing Codec::Null parameter to ManifestWriterBuilder::new call in the incoming test.
let level_raw = parse_property(
props,
TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL,
255u8,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make a constant?

@emkornfield
Copy link
Copy Markdown
Contributor Author

Moving to draft until we finalize #1876

@emkornfield emkornfield marked this pull request as draft February 2, 2026 04:27
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 6, 2026
@emkornfield
Copy link
Copy Markdown
Contributor Author

Still waiting on Moving to draft until we finalize #1876, I'll resolve conflicts.

@github-actions github-actions bot removed the stale label Mar 7, 2026
blackmwk pushed a commit that referenced this pull request Mar 11, 2026
## Which issue does this PR close?

Split off from #1851

- Partially fixes #1731.

## What changes are included in this PR?

This change honors the compression setting for metadata.json file
(`write.metadata.compression-codec`).

## Are these changes tested?

Add unit test to verify files are gzipped when the flag is enabled.

BREAKING CHANGE: Make `write_to` take `MetadataLocation`

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
blackmwk pushed a commit that referenced this pull request Mar 31, 2026
## Which issue does this PR close?

This is an intermediate PR for #1731 

I'm splitting out changes from
#1851 to the compression
codec to make it easier to review. Once we decide on approach here and
merge it I'll update #1851
accordingly.

## What changes are included in this PR?

- Add optional compression level to gzip and zstd (needed for when avro
compression usage).
- Add Snappy as a compression codec (also will be used for Avro)
- Manually code up some previously auto-generated methods as a result.

AI helped with an initial version of this PR.

## Are these changes tested?

Additional unit tests
big-mac-slice pushed a commit to perpetualsystems/iceberg-rust that referenced this pull request Apr 2, 2026
)

## Which issue does this PR close?

Split off from apache#1851

- Partially fixes apache#1731.

## What changes are included in this PR?

This change honors the compression setting for metadata.json file
(`write.metadata.compression-codec`).

## Are these changes tested?

Add unit test to verify files are gzipped when the flag is enabled.

BREAKING CHANGE: Make `write_to` take `MetadataLocation`

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] iceberg-rust does not respect compression settings for metadata & avro

4 participants