Skip to content

feat(rust/sedona-expr): Add GroupsAccumulator to framework and implementation for ST_Envelope_Agg#510

Merged
paleolimbot merged 11 commits intoapache:mainfrom
paleolimbot:groups-accumulator
Jan 16, 2026
Merged

feat(rust/sedona-expr): Add GroupsAccumulator to framework and implementation for ST_Envelope_Agg#510
paleolimbot merged 11 commits intoapache:mainfrom
paleolimbot:groups-accumulator

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jan 13, 2026

This PR allows aggregate functions implemented using the SedonaAccumulator to implement their own GroupsAccumulator (which they should usually do for performance on tiny groups). This also implements one for ST_Envelope_Agg() and the appropriate extension to the AggregateUdfTester (since I couldn't seem to trigger all the features of the GroupsAccumulator reliably with simple queries).

This would also benefit other aggregate functions (but they would require that implementation to exist and it might take a bit).

Closes #167.

@paleolimbot paleolimbot changed the title feat(rust/sedona-expr): Add GroupsAccumulator to framework + ST_Envelope_Agg feat(rust/sedona-expr): Add GroupsAccumulator to framework and implementation for ST_Envelope_Agg Jan 14, 2026
@paleolimbot paleolimbot requested a review from Copilot January 15, 2026 05:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements GroupsAccumulator support for aggregate functions in the Sedona Rust framework, specifically implementing it for ST_Envelope_Agg(). The GroupsAccumulator is a performance optimization in DataFusion for efficiently aggregating many small groups.

Changes:

  • Extended the SedonaAccumulator trait with optional groups_accumulator_supported() and groups_accumulator() methods
  • Implemented BoundsGroupsAccumulator2D for ST_Envelope_Agg() to optimize grouped aggregations
  • Enhanced AggregateUdfTester with aggregate_groups() method to test groups accumulator functionality

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
rust/sedona-testing/src/testers.rs Added aggregate_groups() method and refactored shared schema/expression initialization
rust/sedona-functions/src/st_envelope_agg.rs Implemented BoundsGroupsAccumulator2D and added comprehensive grouped aggregation tests
rust/sedona-expr/src/aggregate_udf.rs Added groups_accumulator_supported() and create_groups_accumulator() to the UDF framework
python/sedonadb/tests/functions/test_aggregate.py Added integration tests for ST_Envelope_Agg including grouped aggregations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@paleolimbot paleolimbot marked this pull request as ready for review January 15, 2026 05:06
@jiayuasu
Copy link
Member

Did you see any performance gain from this?

@paleolimbot
Copy link
Member Author

For this particular example (ST_Envelope) it's hugely beneficial:

import sedona.db

sd = sedona.db.connect()

# 10 million points
sd.sql("""SELECT * FROM sd_random_geometry('{"target_rows": 10000000}')""").to_memtable().to_view("pts")

# pip install --pre --force-reinstall sedonadb --extra-index-url=https://repo.fury.io/sedona-nightlies/
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 1000""").execute()
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 10000""").execute()
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 100000""").execute()
#> CPU times: user 885 ms, sys: 29.9 ms, total: 915 ms
#> Wall time: 95.7 ms
#> CPU times: user 1.03 s, sys: 13.5 ms, total: 1.04 s
#> Wall time: 99.2 ms
#> CPU times: user 15.8 s, sys: 146 ms, total: 15.9 s
#> Wall time: 1.44 s

# pip install python/sedonadb
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 1000""").execute()
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 10000""").execute()
%time sd.sql("""SELECT ST_Envelope_Agg(geometry) as e FROM pts GROUP BY id % 100000""").execute()
#> CPU times: user 283 ms, sys: 34.6 ms, total: 317 ms
#> Wall time: 43.1 ms
#> CPU times: user 238 ms, sys: 7.81 ms, total: 246 ms
#> Wall time: 25.7 ms
#> CPU times: user 407 ms, sys: 58 ms, total: 465 ms
#> Wall time: 44.5 ms

@jiayuasu
Copy link
Member

Great work!

@paleolimbot paleolimbot merged commit fb15876 into apache:main Jan 16, 2026
15 checks passed
@paleolimbot paleolimbot deleted the groups-accumulator branch January 16, 2026 02:54
@paleolimbot paleolimbot added this to the 0.3.0 milestone Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rust/sedona-expr: Aggregate function framework should expose GroupsAccumulator

4 participants