Skip to content

Conversation

@torgebo
Copy link

@torgebo torgebo commented Oct 18, 2025

This is done by iterating over the file set.
We check that the schemas agree before concatenating.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Are these changes tested?

Tested manually by checking file outputs.

Are there any user-facing changes?

No changes to CLI.

This is done by iterating over the file set.
We check that the schemas agree before concatenating.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 18, 2025
@tustvold
Copy link
Contributor

How large are these files that are being concatenated? I ask as parquet-concat copies row groups as is, however, ideally row groups should be on the order of 1MB per column. I worry this might just move the problem and generate very degenerate parquet files...

@torgebo
Copy link
Author

torgebo commented Oct 19, 2025

Hi, good point.

It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to parquet-concat, its output file should present those same degeneracies.

It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is one true way of doing compute, you would likely not need a tool like parquet-concat.

The suggested change brings the behaviour of parquet-concat closer to that of the traditional cat Unix tool, by handling the files as a "stream". Linux ulimit is as low as 1024 or even lower on many systems. Many compute professionals (e.g. university professionals) are using (time sharing) systems where they might not have control over the system settings (or they might need to reserve the file descriptors to other use). It seems reasonable to let them concatenate their parquet files even so.

@tustvold
Copy link
Contributor

tustvold commented Oct 19, 2025

Lets say you concatenate 1000 files, each with 10 columns. In order for the output to not be degenerate (have column chunks smaller than 1MB) it would mean a parquet file of at least 10GB.

Or to phrase it more explicitly, this tool is not meant for concatenating large numbers of small files into larger files. You would need to use a different tool that also rewrites the actual row group contents.

As an aside your comment reads a lot like something written by ChatGPT, it's generally courteous to disclose usage

@torgebo
Copy link
Author

torgebo commented Oct 23, 2025

@tustvold I have some problems understanding why the user should not be allowed to concatenate into a file larger than 10Gb, as you say above. Please help me understand what I'm missing. The central parquet documentation site provides the following comment: https://parquet.apache.org/docs/file-format/configurations/
speaking of row groups of one Gb.

For the napkin calculations on row group contents:
Take the valid boundary case of a 1-column dataset. Assume each file is 128Mb, and that there's 1000 files. Then the output would be 128Gb, which is well within what we can generate with the new version of the tool on a single laptop. The row group size can be up towards 128Mb, which should not be too bad (should optimally be larger, not smaller?). I would not call it "degenerate".


I think we can assume the user of parquet-concat needs their data input into a single file, perhaps as part of integrating with an external system. They likely already have their data on disk. They might choose to use parquet-concat over alternatives because it (a) copies the data points correctly through a simple interface, (b) preserves the schema from the original dataset, and (c) it's reasonably performant*.

Setting the output file size is an unlikely choice for the user.

  • Except for the limit on # of open files.

@torgebo
Copy link
Author

torgebo commented Oct 23, 2025

As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @torgebo -- I think this is a reasonable change in my mind

While I agree with @tustvold that concatenating 1000s of files may result in other problems, I see no reason not to allow this tool to do that work if requested by the user

As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests.

Yes, please -- I think with some integration tests we could merge this PR

Ok((reader, metadata))
})
.collect::<Result<Vec<_>>>()?;
let schema = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no tests, I think a comment here explaining the rationale for not keeping the files open is probably good

Suggested change
let schema = {
// Check schemas in a first pass to make sure they all match
// and then do the work in a second pass after the validatin
let schema = {

.iter()
.map(|x| {
let reader = File::open(x)?;
let metadata = ParquetMetaDataReader::new().parse_and_finish(&reader)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still parsing the metadata from all the files into memory before checking the schema

If you are looking to support the large file usecase better, it would require fewer resources (memory) to read the schema from the first file, and then verify the schema from the remaining files one at a time rather than reading the metadata for all files before validating

@alamb alamb marked this pull request as draft October 27, 2025 18:55
@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet-concat - handle large number of files

3 participants