Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print Parquet BasicTypeInfo id when present #7094

Merged
merged 3 commits into from
Feb 8, 2025

Conversation

devinrsmith
Copy link
Contributor

@devinrsmith devinrsmith commented Feb 7, 2025

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinrsmith -- looks good to me

Any chance you can also improve documentation of print_schema with an example of the output too? 🙏 (not needed to merge this PR, just a request)

So for example, update https://arrow.apache.org/rust/parquet/schema/printer/fn.print_schema.html with the output of your test:

message schema [2] {
 OPTIONAL group field [1] {
    REQUIRED INT32 f1 [0] (INT_32);
    OPTIONAL BYTE_ARRAY f2 [1] (UTF8);
    OPTIONAL BYTE_ARRAY f3 [1] (STRING);
  }
  REPEATED FIXED_LEN_BYTE_ARRAY (12) f4 [2] (INTERVAL);
}";

.build();
let f3 = Type::primitive_type_builder("f3", PhysicalType::BYTE_ARRAY)
.with_logical_type(Some(LogicalType::String))
.with_id(Some(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a problem that this is the same id as f2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet itself treats the ids as metadata, it does not care about them. Higher level systems that map their own schemas into parquet schemas (such as Iceberg) may have stricter requirements and care.

.unwrap();
p.print(&message);
}
let expected = "message schema [2] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that without the changes in this PR the output looks like the following (no ids) 👍

Screenshot 2025-02-07 at 8 00 10 AM

@devinrsmith
Copy link
Contributor Author

Thanks @alamb for the quick review!

Any chance you can also improve documentation of print_schema with an example of the output too?

I can add that into this PR, no problem; I'll have that out to you late today.

@devinrsmith devinrsmith requested a review from alamb February 8, 2025 06:13
///
/// outputs
///
/// ```text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb merged commit 4e34203 into apache:main Feb 8, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 8, 2025

Thanks again @devinrsmith ❤️

phillipleblanc pushed a commit to spiceai/arrow-rs that referenced this pull request Feb 10, 2025
* Print Parquet BasicTypeInfo id when present

* Improve print_schema documentation

* tiny cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[parquet] Print Parquet BasicTypeInfo id when present
2 participants