parquet-concat: handle large number of files. #8651

torgebo · 2025-10-18T18:53:34Z

This is done by iterating over the file set.
We check that the schemas agree before concatenating.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Closes parquet-concat - handle large number of files #8650 .

Are these changes tested?

Tested manually by checking file outputs.

Are there any user-facing changes?

No changes to CLI.

This is done by iterating over the file set. We check that the schemas agree before concatenating.

tustvold · 2025-10-19T00:01:16Z

How large are these files that are being concatenated? I ask as parquet-concat copies row groups as is, however, ideally row groups should be on the order of 1MB per column. I worry this might just move the problem and generate very degenerate parquet files...

torgebo · 2025-10-19T15:59:18Z

Hi, good point.

It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to parquet-concat, its output file should present those same degeneracies.

It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is one true way of doing compute, you would likely not need a tool like parquet-concat.

The suggested change brings the behaviour of parquet-concat closer to that of the traditional cat Unix tool, by handling the files as a "stream". Linux ulimit is as low as 1024 or even lower on many systems. Many compute professionals (e.g. university professionals) are using (time sharing) systems where they might not have control over the system settings (or they might need to reserve the file descriptors to other use). It seems reasonable to let them concatenate their parquet files even so.

tustvold · 2025-10-19T16:19:33Z

Lets say you concatenate 1000 files, each with 10 columns. In order for the output to not be degenerate (have column chunks smaller than 1MB) it would mean a parquet file of at least 10GB.

Or to phrase it more explicitly, this tool is not meant for concatenating large numbers of small files into larger files. You would need to use a different tool that also rewrites the actual row group contents.

As an aside your comment reads a lot like something written by ChatGPT, it's generally courteous to disclose usage

torgebo · 2025-10-23T21:42:09Z

@tustvold I have some problems understanding why the user should not be allowed to concatenate into a file larger than 10Gb, as you say above. Please help me understand what I'm missing. The central parquet documentation site provides the following comment: https://parquet.apache.org/docs/file-format/configurations/
speaking of row groups of one Gb.

For the napkin calculations on row group contents:
Take the valid boundary case of a 1-column dataset. Assume each file is 128Mb, and that there's 1000 files. Then the output would be 128Gb, which is well within what we can generate with the new version of the tool on a single laptop. The row group size can be up towards 128Mb, which should not be too bad (should optimally be larger, not smaller?). I would not call it "degenerate".

I think we can assume the user of parquet-concat needs their data input into a single file, perhaps as part of integrating with an external system. They likely already have their data on disk. They might choose to use parquet-concat over alternatives because it (a) copies the data points correctly through a simple interface, (b) preserves the schema from the original dataset, and (c) it's reasonably performant*.

Setting the output file size is an unlikely choice for the user.

Except for the limit on # of open files.

torgebo · 2025-10-23T21:50:13Z

As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests.

alamb

Thank you @torgebo -- I think this is a reasonable change in my mind

While I agree with @tustvold that concatenating 1000s of files may result in other problems, I see no reason not to allow this tool to do that work if requested by the user

As an orthogonal concern, and if there is interest, let me know if I should add a few integration tests.

Yes, please -- I think with some integration tests we could merge this PR

alamb · 2025-10-26T10:49:20Z

parquet/src/bin/parquet-concat.rs

-                Ok((reader, metadata))
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let schema = {


Since there are no tests, I think a comment here explaining the rationale for not keeping the files open is probably good

Suggested change

let schema = {

// Check schemas in a first pass to make sure they all match

// and then do the work in a second pass after the validatin

let schema = {

alamb · 2025-10-26T10:51:55Z

parquet/src/bin/parquet-concat.rs

+                .iter()
+                .map(|x| {
+                    let reader = File::open(x)?;
+                    let metadata = ParquetMetaDataReader::new().parse_and_finish(&reader)?;


This is still parsing the metadata from all the files into memory before checking the schema

If you are looking to support the large file usecase better, it would require fewer resources (memory) to read the schema from the first file, and then verify the schema from the remaining files one at a time rather than reading the metadata for all files before validating

alamb · 2025-10-27T18:55:29Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

parquet-concat: handle large number of files.

ce840d9

This is done by iterating over the file set. We check that the schemas agree before concatenating.

github-actions bot added the parquet Changes to the parquet crate label Oct 18, 2025

alamb reviewed Oct 26, 2025

View reviewed changes

alamb marked this pull request as draft October 27, 2025 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

parquet-concat: handle large number of files. #8651

parquet-concat: handle large number of files. #8651

torgebo commented Oct 18, 2025

Uh oh!

tustvold commented Oct 19, 2025

Uh oh!

torgebo commented Oct 19, 2025 •

edited

Loading

Uh oh!

tustvold commented Oct 19, 2025 •

edited

Loading

Uh oh!

torgebo commented Oct 23, 2025 •

edited

Loading

Uh oh!

torgebo commented Oct 23, 2025

Uh oh!

alamb left a comment •

edited

Loading

Uh oh!

alamb Oct 26, 2025

Uh oh!

alamb Oct 26, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

parquet-concat: handle large number of files. #8651

Are you sure you want to change the base?

parquet-concat: handle large number of files. #8651

Conversation

torgebo commented Oct 18, 2025

Which issue does this PR close?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

tustvold commented Oct 19, 2025

Uh oh!

torgebo commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

torgebo commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

torgebo commented Oct 23, 2025

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

torgebo commented Oct 19, 2025 •

edited

Loading

tustvold commented Oct 19, 2025 •

edited

Loading

torgebo commented Oct 23, 2025 •

edited

Loading

alamb left a comment •

edited

Loading