fix(functions-aggregate): drain CORR state vectors for streaming aggregation #19669

geoffreyclaude · 2026-01-06T16:08:03Z

Which issue does this PR close?

N/A

Rationale for this change

This change addresses a failure in the CORR aggregate function when running in streaming mode. The CorrelationGroupsAccumulator (introduced in PR #13581) was failing to drain its state vectors during EmitTo::First calls, causing internal state to persist across emissions. This led to memory leaks, incorrect results for subsequent groups, and "length mismatch" errors because the internal vector sizes diverged from the number of emitted groups.

Reproducer

# Setup data
CREATE TABLE stream_test (
    g INT,
    x DOUBLE,
    y DOUBLE
) AS VALUES
(1, 1.0, 1.0), (1, 2.0, 2.0),
(2, 1.0, 5.0), (2, 2.0, 5.0),
(3, 1.0, 1.0), (3, 2.0, 2.0);

# Trigger streaming aggregation via sorted subquery
SELECT
  g,
  CORR(x, y)
FROM (SELECT * FROM stream_test ORDER BY g LIMIT 10000)
GROUP BY g
ORDER BY g;

Before: DataFusion error: Arrow error: Invalid argument error: all columns in a record batch must have the same length

After:

1 1
2 NULL
3 1

What changes are included in this PR?

This PR is structured into two commits: the first adds a failing test case to demonstrate the issue, and the second implements the fix.

The accumulator now uses emit_to.take_needed() in both evaluate and state to properly consume the emitted portions of the state vectors. Additionally, the size() implementation has been updated to use vector capacity for more accurate memory accounting.

Are these changes tested?

Yes, a new test case in aggregate.slt triggers streaming aggregation via an ordered subquery. This test previously crashed with an Arrow length mismatch error and now produces correct results.

Are there any user-facing changes?

Yes, SQL queries that trigger streaming aggregation using CORR (typically those with specific ordering requirements) will now succeed instead of failing with a length mismatch error.

martin-g · 2026-01-07T09:21:09Z

datafusion/sqllogictest/test_files/aggregate.slt

+2 2 NULL
+2 3 NULL
+2 4 NULL
+


It would be good to add a companion EXPLAIN query to verify that it uses the streaming path.

I had it at first, and removed it as I found it too verbose. Same with a dedicated unit test in correlation.rs, which seemed out of place and only serving as a "demo" of the bug.

Adding just the EXPLAIN for CORR seems too specific to me here. However, I think it would make a lot of sense to actually have a dedicated .slt that runs EXPLAIN and the actual query for all aggregates.

@martin-g WDYT?

EDIT: pushed new comprehensive tests in commit test: add comprehensive aggregate tests for streaming aggregation

Either way is fine as long as there is a way to assert that it behaves the way it is supposed to be.

martin-g

LGTM

geoffreyclaude added 2 commits January 6, 2026 17:05

test: demonstrate failure in CORR streaming aggregation

79767e2

fix: drain CORR state vectors on EmitTo::First in streaming aggregation

f84a390

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 6, 2026

martin-g reviewed Jan 7, 2026

View reviewed changes

test: add comprehensive aggregate tests for streaming aggregation

d2bbcb0

martin-g approved these changes Jan 7, 2026

View reviewed changes

Jefffrey approved these changes Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(functions-aggregate): drain CORR state vectors for streaming aggregation #19669

fix(functions-aggregate): drain CORR state vectors for streaming aggregation #19669

geoffreyclaude commented Jan 6, 2026 •

edited

Loading

Uh oh!

martin-g Jan 7, 2026

Uh oh!

geoffreyclaude Jan 7, 2026 •

edited

Loading

Uh oh!

martin-g Jan 7, 2026

Uh oh!

martin-g left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+2 NULL
+3 NULL
+4 NULL

fix(functions-aggregate): drain CORR state vectors for streaming aggregation #19669

Are you sure you want to change the base?

fix(functions-aggregate): drain CORR state vectors for streaming aggregation #19669

Conversation

geoffreyclaude commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Reproducer

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

martin-g Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

geoffreyclaude commented Jan 6, 2026 •

edited

Loading

geoffreyclaude Jan 7, 2026 •

edited

Loading