Skip to content

feat: Flush row group by buffered bytes in parquet writer#15751

Closed
wecharyu wants to merge 7 commits intofacebookincubator:mainfrom
wecharyu:enhance_flush_row_group
Closed

feat: Flush row group by buffered bytes in parquet writer#15751
wecharyu wants to merge 7 commits intofacebookincubator:mainfrom
wecharyu:enhance_flush_row_group

Conversation

@wecharyu
Copy link
Contributor

#5442 check bytesInRowGroup based on uncompressed bytes, which will cause the final compressed row group is much more smaller than config bytesInRowGroup.

In this patch we flush row group based on buffered size on arrow, it could reduce the row group numbers and improve the read performance.

@netlify
Copy link

netlify bot commented Dec 11, 2025

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 13bc3d4
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/695dbddccb6bba000897012f
😎 Deploy Preview https://deploy-preview-15751--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 11, 2025
@meta-codesync
Copy link

meta-codesync bot commented Feb 6, 2026

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92526575.

@meta-codesync
Copy link

meta-codesync bot commented Feb 7, 2026

@xiaoxmeng merged this pull request in 6e01ab2.

yaooqinn added a commit to yaooqinn/velox that referenced this pull request Feb 9, 2026
The test passes 0 as bytesInRowGroup to parquet::DefaultFlushPolicy,
which flows into Arrow's WriterProperties::Builder::maxRowGroupBytes().
Since PR facebookincubator#15751 added ARROW_CHECK_GT(maxRowGroupBytes, 0) validation,
passing 0 causes a SIGABRT crash.

Use 128MB (the default value) instead, so the test controls flushing
by row count only while satisfying the positive-value constraint.
@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 62c4a06.

@xiaoxmeng
Copy link
Contributor

@wecharyu sorry, but I merged this diff by mistake. It is now reverted by #16317. You can continue work on this. Thanks!

@wecharyu
Copy link
Contributor Author

@xiaoxmeng nvm, I'll align the code with arrow-48467.

arup-chauhan pushed a commit to arup-chauhan/velox that referenced this pull request Feb 23, 2026
…cubator#15751)

Summary:
facebookincubator#5442 check `bytesInRowGroup` based on uncompressed bytes, which will cause the final compressed row group is much more smaller than config `bytesInRowGroup`.

In this patch we flush row group based on buffered size on arrow, it could reduce the row group numbers and improve the read performance.

Pull Request resolved: facebookincubator#15751

Reviewed By: tanjialiang

Differential Revision: D92526575

Pulled By: xiaoxmeng

fbshipit-source-id: b9285e585ed631b75bac2d8c580efbd1f5de9587
Copy link
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wecharyu Thanks for the code.
Do you have further plan to rework on this?

if (batch_size > 0) {
RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
} else if (offset < batch.num_rows()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read compare to original code, since it implies when "batch_size <= 0" && "offset < batch.num_rows()" it writes a new group.
Can we add a var such as int64_t available_rows = max_row_group_length - group_rows; and then adjust this value based on your new code
while {
...

if (group_rows > 0) {
if (buffered_bytes >= max_row_group_bytes) {
available_rows = 0;
} else if (buffered_bytes > 0) {
double avg_row_size = buffered_bytes * 1.0 / group_rows;
int64_t rows_by_bytes =
static_cast<int64_t>((max_row_group_bytes - buffered_bytes) / avg_row_size);
available_rows = std::min(available_rows, rows_by_bytes);
}

if (available_rows <= 0) {
RETURN_NOT_OK(NewBufferedRowGroup());
}
}

int64_t batch_size = std::min(available_rows, batch.num_rows() - offset);
RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
}

@wecharyu
Copy link
Contributor Author

Hi @PingLiuPing, I want first make apache/arrow#48468 ready, then we can cherry-pick it and make little additional change in velox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants