Skip to content

Conversation

@colinmarc
Copy link
Contributor

The spec mentions this naming convention here:

https://iceberg.apache.org/spec/#naming-for-gzip-compressed-metadata-json-files

Which issue does this PR close?

What changes are included in this PR?

Support for reading compressed metadata.

Are these changes tested?

Yes.

@colinmarc colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 654de6b to cd16381 Compare October 29, 2025 21:26
let metadata_content = input_file.read().await?;
let metadata = serde_json::from_slice::<TableMetadata>(&metadata_content)?;

let metadata = if metadata_location.as_ref().ends_with(".gz.metadata.json") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to optionally support the Java Iceberg alternative?

The Java reference implementation can additionally read GZIP compressed files with the suffix metadata.json.gz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to have one convention, to me, but happy either way.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it. The wording of the spec ("some implementations require") seems to suggest it would be better to have no naming requirement at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it.

That would be a really elegant solution, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done!

@colinmarc colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 9892bae to 011512a Compare October 30, 2025 07:37
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor performance nit.


use _serde::TableMetadataEnum;
use chrono::{DateTime, Utc};
use flate2::read::GzDecoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you go to read metadata_content it's already in memory as a &[u8] so I think we should use flate2::bufread::GzDecoder here. It might be an imperceptible performance difference, but you never know how big metadata might get :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should be the opposite, no? With bufread we'll pay for an extra copy, but the "syscalls" (read) are free.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, I had it backwards in my head, sorry about that!

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinmarc!

@colinmarc colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 2d87efe to 453dadc Compare October 30, 2025 19:25
@colinmarc
Copy link
Contributor Author

Just found one case (StaticTable) that wasn't using TableMetadata::read_from. Fixed now.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinmarc for this pr!

let metadata = if metadata_content.len() > 2
&& metadata_content[0] == 0x1F
&& metadata_content[1] == 0x8B
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a debug log here to explain why we choose to use try to decompress it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to pull in a dependency? Neither tracing or log are available here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tracing is already there, anyway, I think the error message is good enough.

The spec mentions that metadata files "may be compressed with GZIP",
here:

    https://iceberg.apache.org/spec/#table-metadata-and-snapshots
@colinmarc colinmarc force-pushed the metadata-compressed branch from 453dadc to dee387b Compare November 4, 2025 12:47
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinmarc for this fix!

@liurenjie1024 liurenjie1024 merged commit 76d8e2d into apache:main Nov 5, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FR: support compressed metadata

3 participants