[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF #53035

Yicong-Huang · 2025-11-13T10:03:06Z

What changes were proposed in this pull request?

This PR introduces an iterator API for Arrow grouped aggregation UDFs in PySpark. It adds support for two new UDF patterns:

Iterator[pa.Array] -> Any for single column aggregations
Iterator[Tuple[pa.Array, ...]] -> Any for multiple column aggregations

The implementation adds a new Python eval type SQL_GROUPED_AGG_ARROW_ITER_UDF with corresponding support in type inference, worker serialization, and Scala execution planning.

Why are the changes needed?

The current Arrow grouped aggregation API requires loading all data for a group into memory at once, which can be problematic for groups with large amounts of data. The iterator API allows processing data in batches, providing:

Memory Efficiency: Processes data incrementally rather than loading entire group into memory
Consistency: Aligns with existing iterator APIs (e.g., SQL_SCALAR_ARROW_ITER_UDF)
Flexibility: Allows initialization of expensive state once per group while processing batches iteratively

Does this PR introduce any user-facing change?

Yes. This PR adds a new API pattern for Arrow grouped aggregation UDFs:

Single column aggregation:

import pyarrow as pa
from typing import Iterator
from pyspark.sql.functions import arrow_udf

@arrow_udf("double")
def arrow_mean(it: Iterator[pa.Array]) -> float:
    sum_val = 0.0
    cnt = 0
    for v in it:
        sum_val += pa.compute.sum(v).as_py()
        cnt += len(v)
    return sum_val / cnt if cnt > 0 else 0.0

df.groupby("id").agg(arrow_mean(df['v'])).show()

Multiple column aggregation:

import pyarrow as pa
import numpy as np
from typing import Iterator, Tuple
from pyspark.sql.functions import arrow_udf

@arrow_udf("double")
def arrow_weighted_mean(it: Iterator[Tuple[pa.Array, pa.Array]]) -> float:
    weighted_sum = 0.0
    weight = 0.0
    for v, w in it:
        weighted_sum += np.dot(v.to_numpy(), w.to_numpy())
        weight += pa.compute.sum(w).as_py()
    return weighted_sum / weight if weight > 0 else 0.0

df.groupby("id").agg(arrow_weighted_mean(df["v"], df["w"])).show()

How was this patch tested?

Added comprehensive unit tests in python/pyspark/sql/tests/arrow/test_arrow_udf_grouped_agg.py:

test_iterator_grouped_agg_single_column() - Tests single column iterator aggregation with Iterator[pa.Array]
test_iterator_grouped_agg_multiple_columns() - Tests multiple column iterator aggregation with Iterator[Tuple[pa.Array, pa.Array]]
test_iterator_grouped_agg_eval_type() - Verifies correct eval type inference from type hints

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by: Cursor with Claude Sonnet 4.5

zhengruifeng · 2025-11-14T03:01:56Z

python/pyspark/sql/pandas/functions.py

+        iteratively, which is more memory-efficient than loading all data at once. The returned
+        scalar can be a python primitive type, a numpy data type, or a `pyarrow.Scalar` instance.
+
+        >>> import pandas as pd


pandas is not used?

zhengruifeng · 2025-11-14T03:02:22Z

python/pyspark/sql/pandas/functions.py

+        ...
+        >>> df = spark.createDataFrame(
+        ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
+        >>> df.groupby("id").agg(arrow_mean(df['v'])).show()  # doctest: +SKIP


do not skip the doctests

zhengruifeng · 2025-11-14T03:02:33Z

python/pyspark/sql/pandas/functions.py

+        >>> df = spark.createDataFrame(
+        ...     [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)],
+        ...     ("id", "v", "w"))
+        >>> df.groupby("id").agg(arrow_weighted_mean(df["v"], df["w"])).show()  # doctest: +SKIP


zhengruifeng · 2025-11-14T03:02:56Z

python/pyspark/sql/pandas/functions.py

+        |  1|               1.6666666666666667|
+        |  2|                7.166666666666667|


Suggested change

| 1| 1.6666666666666667|

| 2| 7.166666666666667|

| 1| 1.6666666666666...|

| 2| 7.166666666666...|

I suggest don't compare the exact values since they may vary due to env/version changes.

zhengruifeng · 2025-11-14T03:11:31Z

python/pyspark/sql/pandas/serializers.py



+# Serializer for SQL_GROUPED_AGG_ARROW_ITER_UDF
+class ArrowStreamAggArrowIterUDFSerializer(ArrowStreamArrowUDFSerializer):


we should consolidate it with ArrowStreamAggArrowUDFSerializer: make ArrowStreamAggArrowUDFSerializer output the iterator and adjust the wrapper of SQL_GROUPED_AGG_ARROW_UDF and SQL_WINDOW_AGG_ARROW_UDF

shall we do it as a follow up?

zhengruifeng · 2025-11-14T03:13:32Z

python/pyspark/sql/pandas/typehints.py

    if is_iterator_array:
        return ArrowUDFType.SCALAR_ITER

+    # Iterator[Tuple[pa.Array, ...]] -> Any


let's move the new inference after pa.Array, ... -> Any

zhengruifeng · 2025-11-14T03:16:15Z

python/pyspark/sql/pandas/serializers.py

+            dataframes_in_group = read_int(stream)
+
+            if dataframes_in_group == 1:
+                batches = list(ArrowStreamSerializer.load_stream(self, stream))


we should not load all batches in a group, this new API is designed to process each group in an incremental approach

zhengruifeng · 2025-11-14T03:21:36Z

also cc @Kimahriman I guess you might be also interested in this PR

…r-api

github-actions bot added SQL CORE PYTHON labels Nov 13, 2025

Yicong-Huang force-pushed the SPARK-53615/feat/arrow-grouped-agg-iterator-api branch 2 times, most recently from c09d7f2 to ade3730 Compare November 13, 2025 10:14

Yicong-Huang changed the title ~~[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF~~ [WIP][SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF Nov 13, 2025

Yicong-Huang force-pushed the SPARK-53615/feat/arrow-grouped-agg-iterator-api branch from ade3730 to fe79337 Compare November 13, 2025 17:50

feat: introduce iterator API for Arrow grouped agg UDF

9eec72f

Yicong-Huang force-pushed the SPARK-53615/feat/arrow-grouped-agg-iterator-api branch from fe79337 to 9eec72f Compare November 13, 2025 18:09

fix: format

37cbf1d

Yicong-Huang changed the title ~~[WIP][SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF~~ [SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF Nov 13, 2025

zhengruifeng reviewed Nov 14, 2025

View reviewed changes

Yicong-Huang added 2 commits November 14, 2025 14:32

fix: comment

91e51e3

Merge branch 'master' into SPARK-53615/feat/arrow-grouped-agg-iterato…

88f7b68

…r-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF #53035

[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF #53035

Yicong-Huang commented Nov 13, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

zhengruifeng Nov 14, 2025

Uh oh!

zhengruifeng commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# Serializer for SQL_GROUPED_AGG_ARROW_ITER_UDF
		class ArrowStreamAggArrowIterUDFSerializer(ArrowStreamArrowUDFSerializer):

[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF #53035

Are you sure you want to change the base?

[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF #53035

Conversation

Yicong-Huang commented Nov 13, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants