feat(datafusion): Add TaskWriter for DataFusion #1769

CTTY · 2025-10-20T23:40:27Z

Which issue does this PR close?

Closes Add TaskWriter #1770

What changes are included in this PR?

Added TaskWriter to leverage RecordBatchPartitionSplitter and projected partition values
Add UnpartitionedWriter to help write unpartitioned data

Are these changes tested?

Added unit tests

crates/iceberg/src/writer/task/mod.rs

crates/iceberg/src/arrow/record_batch_partition_splitter.rs

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs

crates/iceberg/src/writer/task/mod.rs

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs

liurenjie1024

I think we are on the right track. I left some comments, and we need to split them into smaller prs.

crates/iceberg/src/arrow/record_batch_partition_splitter.rs

liurenjie1024 · 2025-10-23T09:54:45Z

crates/integrations/datafusion/src/writer/task.rs

+    Fanout(FanoutWriter<B>),
+    /// Writer for partitioned tables with sorted data (maintains single active writer)
+    Clustered(ClusteredWriter<B>),


We could simplify this as

Partitioned { splitter: RecordBatchSplitter, partitioned_writer: Arc<dyn PartitionedWriter> }

…atchPartitionSplitter (#1781) ## Which issue does this PR close? - Closes #1786 - Covered some of changes from the previous draft: #1769 ## What changes are included in this PR? - Move PartitionValueCalculator to core/arrow so it can be reused by RecordBatchPartitionSplitter - Allow skipping partition value calculation in partition splitter for projected batches - Return <PartitionKey, RecordBatch> rather than <Struct, RecordBatch> pairs in RecordBatchPartitionSplitter::split ## Are these changes tested? Added uts

liurenjie1024

Thanks @CTTY for this pr, generally LGTM!

crates/integrations/datafusion/src/task_writer.rs

liurenjie1024 · 2025-11-03T10:07:08Z

crates/integrations/datafusion/src/task_writer.rs

+
+        Self {
+            writer,
+            partition_splitter: None,


Why not init it here?

This is because partition_splitter requires the schema of record batches, which may contain projected column and differ from the Iceberg schema. it would be safer to just use the schema of record batches directly to initialize partition_splitter

I'm not convinced, in fact, this schema could be inferred from datafusion's ExecutionPlan

Anyway, this is a small code style problem, we can fix it later.

liurenjie1024

Thanks @CTTY for this pr!

liurenjie1024 · 2025-11-05T09:09:36Z

crates/integrations/datafusion/src/task_writer.rs

+
+        Self {
+            writer,
+            partition_splitter: None,


I'm not convinced, in fact, this schema could be inferred from datafusion's ExecutionPlan

CTTY commented Oct 20, 2025

View reviewed changes

crates/iceberg/src/writer/task/mod.rs Outdated Show resolved Hide resolved

CTTY commented Oct 20, 2025

View reviewed changes

crates/iceberg/src/writer/task/mod.rs Outdated Show resolved Hide resolved

CTTY commented Oct 21, 2025

View reviewed changes

crates/iceberg/src/arrow/record_batch_partition_splitter.rs Show resolved Hide resolved

CTTY commented Oct 21, 2025

View reviewed changes

crates/iceberg/src/arrow/record_batch_partition_splitter.rs Show resolved Hide resolved

CTTY commented Oct 21, 2025

View reviewed changes

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs Outdated Show resolved Hide resolved

liurenjie1024 reviewed Oct 22, 2025

View reviewed changes

crates/iceberg/src/writer/task/mod.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs Outdated Show resolved Hide resolved

CTTY force-pushed the ctty/task-writer branch from e875e8e to f4b72ef Compare October 23, 2025 03:05

liurenjie1024 reviewed Oct 23, 2025

View reviewed changes

CTTY mentioned this pull request Oct 23, 2025

refactor(arrow,datafusion): Reuse PartitionValueCalculator in RecordBatchPartitionSplitter #1781

Merged

CTTY closed this Oct 29, 2025

CTTY force-pushed the ctty/task-writer branch from c5061e2 to d3d3127 Compare October 29, 2025 09:30

Add new task writer and unpartitioned writer

116aae8

CTTY reopened this Oct 29, 2025

CTTY changed the title ~~feat(io): UnpartitionedWriter + TaskWriter~~ feat(datafusion): Add TaskWriter for DataFusion Oct 29, 2025

CTTY marked this pull request as ready for review October 29, 2025 09:37

removed allow dead code for partition splitter

a53d872

CTTY requested a review from liurenjie1024 October 29, 2025 14:12

liurenjie1024 reviewed Nov 3, 2025

View reviewed changes

CTTY added 2 commits November 4, 2025 22:02

Merge branch 'main' into ctty/task-writer

a24b51d

make task writer pub(crate)

f1c1f7a

CTTY requested a review from liurenjie1024 November 5, 2025 07:33

liurenjie1024 approved these changes Nov 5, 2025

View reviewed changes

liurenjie1024 merged commit a970a0c into apache:main Nov 5, 2025
16 checks passed

feat(datafusion): Add TaskWriter for DataFusion #1769

feat(datafusion): Add TaskWriter for DataFusion #1769

Conversation

CTTY commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CTTY commented Oct 20, 2025 •

edited

Loading

CTTY Nov 5, 2025 •

edited

Loading