feat: Flush row group by buffered bytes in parquet writer by wecharyu · Pull Request #15751 · facebookincubator/velox

wecharyu · 2025-12-11T18:15:04Z

#5442 check bytesInRowGroup based on uncompressed bytes, which will cause the final compressed row group is much more smaller than config bytesInRowGroup.

In this patch we flush row group based on buffered size on arrow, it could reduce the row group numbers and improve the read performance.

netlify · 2025-12-11T18:15:11Z

✅ Deploy Preview for meta-velox ready!

Name	Link
🔨 Latest commit	`13bc3d4`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/695dbddccb6bba000897012f
😎 Deploy Preview	https://deploy-preview-15751--meta-velox.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…roup

meta-codesync · 2026-02-06T16:40:33Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92526575.

meta-codesync · 2026-02-07T09:34:29Z

@xiaoxmeng merged this pull request in 6e01ab2.

The test passes 0 as bytesInRowGroup to parquet::DefaultFlushPolicy, which flows into Arrow's WriterProperties::Builder::maxRowGroupBytes(). Since PR facebookincubator#15751 added ARROW_CHECK_GT(maxRowGroupBytes, 0) validation, passing 0 causes a SIGABRT crash. Use 128MB (the default value) instead, so the test controls flushing by row count only while satisfying the positive-value constraint.

facebook-github-bot · 2026-02-10T07:59:19Z

This pull request has been reverted by 62c4a06.

xiaoxmeng · 2026-02-10T08:19:45Z

@wecharyu sorry, but I merged this diff by mistake. It is now reverted by #16317. You can continue work on this. Thanks!

wecharyu · 2026-02-10T10:12:52Z

@xiaoxmeng nvm, I'll align the code with arrow-48467.

…cubator#15751) Summary: facebookincubator#5442 check `bytesInRowGroup` based on uncompressed bytes, which will cause the final compressed row group is much more smaller than config `bytesInRowGroup`. In this patch we flush row group based on buffered size on arrow, it could reduce the row group numbers and improve the read performance. Pull Request resolved: facebookincubator#15751 Reviewed By: tanjialiang Differential Revision: D92526575 Pulled By: xiaoxmeng fbshipit-source-id: b9285e585ed631b75bac2d8c580efbd1f5de9587

PingLiuPing

@wecharyu Thanks for the code.
Do you have further plan to rework on this?

PingLiuPing · 2026-02-27T16:32:10Z

velox/dwio/parquet/writer/arrow/Writer.cpp

+      if (batch_size > 0) {
+        RETURN_NOT_OK(WriteBatch(offset, batch_size));
+        offset += batch_size;
+      } else if (offset < batch.num_rows()) {


This is hard to read compare to original code, since it implies when "batch_size <= 0" && "offset < batch.num_rows()" it writes a new group.
Can we add a var such as int64_t available_rows = max_row_group_length - group_rows; and then adjust this value based on your new code
while {
...

if (group_rows > 0) {
if (buffered_bytes >= max_row_group_bytes) {
available_rows = 0;
} else if (buffered_bytes > 0) {
double avg_row_size = buffered_bytes * 1.0 / group_rows;
int64_t rows_by_bytes =
static_cast<int64_t>((max_row_group_bytes - buffered_bytes) / avg_row_size);
available_rows = std::min(available_rows, rows_by_bytes);
}

if (available_rows <= 0) {
RETURN_NOT_OK(NewBufferedRowGroup());
}
}

int64_t batch_size = std::min(available_rows, batch.num_rows() - offset);
RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
}

wecharyu · 2026-02-28T08:32:28Z

Hi @PingLiuPing, I want first make apache/arrow#48468 ready, then we can cherry-pick it and make little additional change in velox.

feat: Flush row group by buffered bytes in parquet writer

082ab0c

wecharyu requested a review from majetideepak as a code owner December 11, 2025 18:15

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 11, 2025

wecharyu added 6 commits December 12, 2025 10:49

fix tests

98ac39f

refine code

a9b437d

Merge remote-tracking branch 'facebook/main' into enhance_flush_row_g…

d9f6db8

…roup

support flush buffer in ArrowDataBufferSink

e6425d6

Merge remote-tracking branch 'facebook/main' into enhance_flush_row_g…

023a8f3

…roup

fix code style

13bc3d4

FelixYBW mentioned this pull request Feb 7, 2026

[VL] useful Velox PRs not merged into upstream apache/incubator-gluten#11585

Open

meta-codesync bot closed this in 6e01ab2 Feb 7, 2026

facebook-github-bot added the Merged label Feb 7, 2026

PingLiuPing mentioned this pull request Feb 8, 2026

fix: Hive unit test flushPolicyWithParquet #16304

Closed

yaooqinn mentioned this pull request Feb 9, 2026

fix(test): Fix HiveDataSinkTest.flushPolicyWithParquet crash #16310

Closed

FelixYBW mentioned this pull request Feb 10, 2026

[VL] Parquet writer rowgroup number is bigger than Spark using the same parameter apache/incubator-gluten#11534

Closed

facebook-github-bot added the Reverted label Feb 10, 2026

FelixYBW mentioned this pull request Feb 13, 2026

fix(parquet): Max target file size not working #16389

Closed

PingLiuPing reviewed Feb 27, 2026

View reviewed changes

jinchengchenghh mentioned this pull request Mar 4, 2026

Map iceberg configuration with Velox configuration apache/incubator-gluten#11703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Flush row group by buffered bytes in parquet writer#15751

feat: Flush row group by buffered bytes in parquet writer#15751
wecharyu wants to merge 7 commits intofacebookincubator:mainfrom
wecharyu:enhance_flush_row_group

wecharyu commented Dec 11, 2025

Uh oh!

netlify bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Feb 6, 2026

Uh oh!

meta-codesync bot commented Feb 7, 2026

Uh oh!

facebook-github-bot commented Feb 10, 2026

Uh oh!

xiaoxmeng commented Feb 10, 2026

Uh oh!

wecharyu commented Feb 10, 2026

Uh oh!

PingLiuPing left a comment

Uh oh!

PingLiuPing Feb 27, 2026

Uh oh!

wecharyu commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wecharyu commented Dec 11, 2025

Uh oh!

netlify bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox ready!

Uh oh!

meta-codesync bot commented Feb 6, 2026

Uh oh!

meta-codesync bot commented Feb 7, 2026

Uh oh!

facebook-github-bot commented Feb 10, 2026

Uh oh!

xiaoxmeng commented Feb 10, 2026

Uh oh!

wecharyu commented Feb 10, 2026

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

PingLiuPing Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

wecharyu commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Dec 11, 2025 •

edited

Loading