-
Couldn't load subscription status.
- Fork 1k
parquet-concat: handle large number of files. #8651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is done by iterating over the file set. We check that the schemas agree before concatenating.
|
How large are these files that are being concatenated? I ask as parquet-concat copies row groups as is, however, ideally row groups should be on the order of 1MB per column. I worry this might just move the problem and generate very degenerate parquet files... |
|
Hi, good point. It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is one true way of doing compute, you would likely not need a tool like The suggested change brings the behaviour of |
|
Lets say you concatenate 1000 files, each with 10 columns. In order for the output to not be degenerate (have column chunks smaller than 1MB) it would mean a parquet file of at least 10GB. Or to phrase it more explicitly, this tool is not meant for concatenating large numbers of small files into larger files. You would need to use a different tool that also rewrites the actual row group contents. As an aside your comment reads a lot like something written by ChatGPT, it's generally courteous to disclose usage |
|
@tustvold I have some problems understanding why the user should not be allowed to concatenate into a file larger than 10Gb, as you say above. Please help me understand what I'm missing. The central parquet documentation site provides the following comment: https://parquet.apache.org/docs/file-format/configurations/ For the napkin calculations on row group contents: I think we can assume the user of Setting the output file size is an unlikely choice for the user.
|
|
As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @torgebo -- I think this is a reasonable change in my mind
While I agree with @tustvold that concatenating 1000s of files may result in other problems, I see no reason not to allow this tool to do that work if requested by the user
As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests.
Yes, please -- I think with some integration tests we could merge this PR
| Ok((reader, metadata)) | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?; | ||
| let schema = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are no tests, I think a comment here explaining the rationale for not keeping the files open is probably good
| let schema = { | |
| // Check schemas in a first pass to make sure they all match | |
| // and then do the work in a second pass after the validatin | |
| let schema = { |
| .iter() | ||
| .map(|x| { | ||
| let reader = File::open(x)?; | ||
| let metadata = ParquetMetaDataReader::new().parse_and_finish(&reader)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still parsing the metadata from all the files into memory before checking the schema
If you are looking to support the large file usecase better, it would require fewer resources (memory) to read the schema from the first file, and then verify the schema from the remaining files one at a time rather than reading the metadata for all files before validating
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
This is done by iterating over the file set.
We check that the schemas agree before concatenating.
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Are these changes tested?
Tested manually by checking file outputs.
Are there any user-facing changes?
No changes to CLI.