refactor: use iceberg-rust rolling writer directly by Li0k · Pull Request #134 · nimtable/iceberg-compaction

Li0k · 2026-03-26T11:24:24Z

Summary

bump iceberg-rust workspace deps to baaa9c7b2deb3e744db21712e4b6ced5891a6012
remove the local RollingIcebergWriter wrapper and use upstream RollingFileWriterBuilder directly
pass target_file_size_bytes and max_concurrent_closes through to the upstream rolling writer
deprecate enable_dynamic_size_estimation and size_estimation_smoothing_factor; both settings are now no-op compatibility shims and their builder setters emit deprecation warnings

Test

cargo test -p iceberg-compaction-core --lib -- --nocapture
cargo test -p iceberg-compaction-integration-tests -- --nocapture

Note

The switch to upstream rolling changes rollover timing from write-ahead prediction to current-size-based rolling.
As a result, output file count changes in the rolling integration case, but total output size remains nearly unchanged.

Before: origin/main
output_files_count = 20
total_output_size  = 1,550,643
avg_size           = 77,532
min/max            = 37,324 / 93,027
sizes              = [
  37324, 37408, 37804, 38224, 38476,
  88099, 89197, 89197, 89575, 89611,
  90343, 90465, 90697, 90733, 91429,
  91819, 91953, 92515, 92747, 93027
]

After: PR
output_files_count = 15
total_output_size  = 1,546,645
avg_size           = 103,109
min/max            = 95,555 / 109,283
sizes              = [
   95555,  96213,  96665,  97007,  98263,
  102925, 103109, 103365, 105255, 106085,
  107269, 107855, 108525, 109271, 109283
]

Copilot

Pull request overview

Refactors compaction output writing to rely on iceberg-rust’s upstream rolling writer implementation (after bumping workspace deps), removing the local rolling writer wrapper while keeping deprecated config knobs as compatibility no-ops.

Changes:

Bump iceberg-rust git workspace dependencies to baaa9c7b2deb3e744db21712e4b6ced5891a6012.
Remove the local RollingIcebergWriter module and switch DataFusion execution to RollingFileWriterBuilder directly (including target_file_size_bytes + max_concurrent_closes passthrough).
Update integration test expectations for upstream rolling behavior and deprecate dynamic size-estimation config fields/setters as no-op shims.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
integration-tests/src/integration_tests.rs	Adjusts rolling compaction assertions to match upstream rolling behavior by counting outputs per partition bucket.
core/src/executor/mod.rs	Stops exporting the removed `iceberg_writer` module.
core/src/executor/iceberg_writer/rolling_iceberg_writer.rs	Deletes the local rolling writer implementation and its unit tests.
core/src/executor/iceberg_writer/mod.rs	Removes the now-defunct `iceberg_writer` module entrypoint.
core/src/executor/datafusion/mod.rs	Constructs writers using `RollingFileWriterBuilder` directly and passes through rolling-related config.
core/src/config/mod.rs	Deprecates dynamic size-estimation config fields and builder setters as no-op compatibility shims.
Cargo.toml	Updates iceberg-related git dependency revisions.
Cargo.lock	Locks updated transitive dependency graph for the new iceberg revision.

Copilot · 2026-03-27T07:13:15Z

integration-tests/src/integration_tests.rs

+        output_files_per_partition.values().all(|count| *count == 3),
+        "Compaction should produce exactly 3 files per partition with upstream rolling: {output_files_per_partition:?}"


This test’s goal (per the comment) is to ensure rolling within a partition doesn’t panic/error, but asserting an exact == 3 files per partition is likely brittle (minor parquet/iceberg writer changes or different compression can shift rollover boundaries). Consider relaxing this to assert that each bucket produced >1 output file (or a small acceptable range), while still asserting every partition bucket is present.

Suggested change

output_files_per_partition.values().all(|count| *count == 3),

"Compaction should produce exactly 3 files per partition with upstream rolling: {output_files_per_partition:?}"

output_files_per_partition.values().all(|count| *count > 1),

"Compaction should produce more than one file per partition with upstream rolling: {output_files_per_partition:?}"

Li0k · 2026-03-27T07:24:50Z

hi @vovacf201 @nagraham @chenzl25

We previously discussed a better size switching strategy. I have now found that the upstream has exposed the in_progress_size interface, which allows us to implement more fine-grained file splitting!

According to my tests, we no longer need to wrap an additional writer externally, and I have also retained concurrent close to ensure that throughput performance is not affected.

After this PR is merged, we will deprecate two config items, for which I have added the deprecated field.

chenzl25

LGTM！

nagraham

Excellent refactor! I see this is going to give us greater accuracy in file sizes, and it greatly reduces complexity in the code base. This makes me very happy.

The problem before is that the old writer was forced into using Arrow's RecordBatch::get_array_memory_size(), which was in-accurate (it even admitted that in its docs). This didn't account for Parquet's columnar encoding or its compression. And it had to rely on this memory size for the whole file. So we would get wild inaccuracies. For some data sets, we would try to compress to 128MB, but get files that were 20-30MB.

The iceberg-rust RollingIcebergWriter addresses the problem because Arrow exposed the current_written_size() function. That will return inner.bytes_written() + inner.in_progress_size().

bytes_written: This is the ACTUAL compressed / encoded bytes flushed to the parquet file
in_progress_size: An estimate of the size of the the arrow column writer.

IIRC, that is exactly how the Java library estimates size as well.

refactor: use upstream rolling writer directly

be25e70

Li0k changed the title ~~refactor: use upstream rolling writer directly~~ WIP refactor: use upstream rolling writer directly Mar 26, 2026

Li0k changed the title ~~WIP refactor: use upstream rolling writer directly~~ WIP: refactor: use upstream rolling writer directly Mar 26, 2026

Li0k added 3 commits March 27, 2026 14:34

test: relax rolling output file count assertion

510e8cb

test: assert per-partition rolling output counts

00ed515

build: bump iceberg-rust to baaa9c7b

46d74d8

Li0k changed the title ~~WIP: refactor: use upstream rolling writer directly~~ refactor: use upstream rolling writer directly Mar 27, 2026

config: deprecate unused rolling estimation settings

e06b41d

Li0k requested review from chenzl25, Copilot, nagraham, vovacf201 and xxhZs March 27, 2026 07:07

Copilot started reviewing on behalf of Li0k March 27, 2026 07:08 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

chenzl25 approved these changes Mar 27, 2026

View reviewed changes

Li0k changed the title ~~refactor: use upstream rolling writer directly~~ refactor: use iceberg-rust rolling writer directly Mar 27, 2026

nagraham approved these changes Mar 27, 2026

View reviewed changes

Li0k added this pull request to the merge queue Mar 29, 2026

Merged via the queue into main with commit 6cd2982 Mar 29, 2026
9 checks passed

Li0k mentioned this pull request Mar 30, 2026

bug: iceberg compaction attempt to subtract with overflow #133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use iceberg-rust rolling writer directly#134

refactor: use iceberg-rust rolling writer directly#134
Li0k merged 5 commits intomainfrom
li0k/remove_local_rolling_latest_main

Li0k commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Li0k commented Mar 27, 2026

Uh oh!

chenzl25 left a comment

Uh oh!

nagraham left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		output_files_per_partition.values().all(\|count\| *count == 3),
		"Compaction should produce exactly 3 files per partition with upstream rolling: {output_files_per_partition:?}"

Conversation

Li0k commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

Note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Li0k commented Mar 27, 2026

Uh oh!

chenzl25 left a comment

Choose a reason for hiding this comment

Uh oh!

nagraham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Li0k commented Mar 26, 2026 •

edited

Loading