Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DISTINCT + ORDER BY in ARRAY_AGG #14413

Merged
merged 4 commits into from
Mar 26, 2025

Conversation

gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Feb 2, 2025

Which issue does this PR close?

Rationale for this change

Completing ARRAY_AGG functionality as a prerequisite for adding the full functionality of STRING_AGG in #14412

What changes are included in this PR?

Adds a Postgres-style support for DISTINCT + ORDER_BY functionality, allowing users to issue statements like:

SELECT ARRAY_AGG(DISTINCT col ORDER BY col) FROM table;

Note that there's a limitation that prohibits ordering by an expression that is not the same as the ARRAY_AGG argument. For example, the following queries are invalid:

SELECT ARRAY_AGG(DISTINCT col ORDER BY other_col) FROM table; ❌
SELECT ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) FROM table; ❌

This is the same limitation that exists on Postgres, example in Postgres fiddle

Are these changes tested?

yes, both in unit tests and sqllogictests

Are there any user-facing changes?

Users will now be able to issue ARRAY_AGG calls mixing DISTINCT and ORDER_BY clauses

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 2, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from 9400d72 to 8eaacd6 Compare February 2, 2025 17:34
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Feb 2, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from 8eaacd6 to f23681c Compare February 2, 2025 17:37
@gabotechs gabotechs changed the title Add support for DISTINCT and ORDER BY in ARRAY_AGG Add support for DISTINCT + ORDER BY in ARRAY_AGG Feb 2, 2025
@@ -193,3 +193,149 @@ pub fn merge_ordered_arrays(

Ok((merged_values, merged_orderings))
}

#[cfg(test)]
mod tests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this tests from

to this file, as they are just testing the merge_ordered_arrays function present in this file, and nothing related to ARRAY_AGG.

As I added there some unit tests that do test ARRAY_AGG, I though that it might be a good idea to move these ones out to a more suitable place.

Comment on lines -342 to +385
if values.len() != 1 {
return internal_err!("expects single batch");
if values.is_empty() {
return Ok(());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As now the distinct accumulator can accept more than 1 batch because of the ordering, removing this restriction was necessary.

@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

Close/reopen to rerun CI

@alamb alamb closed this Feb 4, 2025
@alamb alamb reopened this Feb 4, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from f23681c to 0d53376 Compare February 15, 2025 09:24
// ARRAY_AGG(DISTINCT col ORDER BY other_col) <- Invalid
// ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) <- Invalid
if acc_args.ordering_req.len() > 1 {
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy-paste typo of the error message? Shouldn't it rather be something more like:

return exec_err!("In an aggregate with DISTINCT, only one ORDER BY expression is allowed");

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restriction goes further than that: all the expressions used in the ORDER BY must appear in as arguments of the aggregation function. In this case, ARRAY_AGG accepts one argument, therefore only that 1 expression is also valid as an ORDER BY expression.

I took the error message from Postgres, but maybe it's better to customize it to the ARRAY_AGG function.

let mut sort_option: Option<SortOptions> = None;
if let Some(order) = acc_args.ordering_req.first() {
if !order.expr.eq(&acc_args.exprs[0]) {
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be only one ORDER BY expression, the error message should refer to it in singular. I'd suggest:

return exec_err!("In an aggregate with DISTINCT, the ORDER BY expression must match the aggregate argument");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be only one ORDER BY expression, the error message should refer to it in singular

This was pretty much the error message that Postgres would return, but they have this limitation as a general rule for all aggregation functions, which does not seem to be the case for DataFusion. The other option would be to fully customize the error message for the specific ARRAY_AGG case, which will result in a less standard but more informative error message. Any opinions here @alamb ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in general I always prefer more specific error messages for better user experience even at the expense of more complicated code

@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg {
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?;

if acc_args.is_distinct {
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?));
// Limitation similar to Postgres. The aggregation function can only mix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is great! Would it be possible to make it more user-facing? Maybe as a rustdoc over the ArrayAgg structure? As well as in the ArrayAgg documentation if it exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to look for inspiration in other aggregation function docs, and they seem overly brief on purpose, no other doc mentions anything about DISTINCT clause behavior or anything like that. Maybe someone else can chime in and throw some light about what's the best approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add this description (about ORDER / DISTINCT in general) to https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#aggregate-functions

At the very least we shoudl add it to the documentation (in the user_doc macro on this UDF definition) so it appears in https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#array-agg

Copy link
Contributor Author

@gabotechs gabotechs Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the user_doc with a brief explanation about how DISTINCT + ORDER BY work with the ARRAY_AGG function, along with an example.

Do you think the "Postgres style" limitation is something that could be applied to all aggregation functions? For the ARRAY_AGG specifically it seems like it delivers a good results/complexity ratio, but I imagine that applying this limitation globally would require at least some internal consensus from PMC people.

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @gabotechs
I suggests to add a few more corner and negative tests

@gabotechs
Copy link
Contributor Author

Hi @alamb, any chance that someone takes a look at this one any time soon? I promise that 80% of the code are just new tests 🙏.

@alamb alamb changed the title Add support for DISTINCT + ORDER BY in ARRAY_AGG Add support for DISTINCT + ORDER BY in ARRAY_AGG Mar 23, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gabotechs -- this is looking quite good to me. I left some comments, let me know what you think.

Sorry for the delay in reviewing -- I am in general swamped, but I do try and review as many PRs as I can. In general I have been prioritizing other things myself (see #15274 for example). I hope we can find some other committers who have time to review improvements to functionality.

I also encourage you to help review other PRs that are flowing into the repo -- this will likely make it easier for you to get people to review your PRs (as your contributions will be more visibile and you can help free up other reviewers' time for your PRs)

@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg {
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?;

if acc_args.is_distinct {
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?));
// Limitation similar to Postgres. The aggregation function can only mix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add this description (about ORDER / DISTINCT in general) to https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#aggregate-functions

At the very least we shoudl add it to the documentation (in the user_doc macro on this UDF definition) so it appears in https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#array-agg

Ok(())
}

struct ArrayAggAccumulatorBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to https://docs.rs/datafusion/latest/datafusion/physical_expr/aggregate/struct.AggregateExprBuilder.html#method.distinct

Though it seems like that structure has no good example, perhaps you could make a PR to add an doc example of how to use it

Copy link
Contributor

@alamb alamb Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I did look at the AggregateExprBuilder, but did not find it suitable for what these tests try to accomplish. These are the fundamental differences:

  • This ArrayAggAccumulatorBuilder is a builder for an accumulator, rather than an expression.
  • This ArrayAggAccumulatorBuilder is very tailored to these specific tests, automatically providing dummy arguments like a column named "col", or a schema with just one list field

I think there's an opportunity to have a generic AccumulatorBuilder though, that could be used for unit testing these aggregation functions from a more internal standpoint, increasing coverage in the aggregate-functions crate.

}
}

fn str_arr(value: ScalarValue) -> Result<Vec<Option<String>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make these tests easier to update using the pre-existing formatter: https://docs.rs/arrow/latest/arrow/util/pretty/fn.pretty_format_batches.html

Recently (after this PR was made) we have been migrating to insta which might be helpful for this kind of test -- see for example #15364

You could create a RecordBatch and then do something like

        assert_snapshot!(batches_to_sort_string(&batches), @r#"
            +----+----+----+
            | a1 | b1 | c1 |
            +----+----+----+
            | 5  | 5  | 50 |
            +----+----+----+
            "#);

Copy link
Contributor Author

@gabotechs gabotechs Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the suggestion, and run in the the following blockers:

  • These tests assert over a ScalarValue rather than a RecordBatch, so intermediate wrappings with arbitrary field names for promoting a ScalarValue to a RecordBatch are necessary (not a big deal).
  • The batches_to_sort_string function will order lines in the output, but will not order array scalar values inside single cells, which is required for this tests.
  • The more verbose assertions are suitable to be managed by insta, but will make it difficult to follow a "write the test first and the implementation later" approach, which is the approach used for writing these tests.

Do you think insta is a good tool for this kind of tests? I imagine that snapshot testing is suitable for tests that are expected to be updated often as the implementation changes, but one of the main points of these tests is that they should never change even if the implementation changes.

If you still think insta is a good tool for this unit tests, I can build some extra tooling for alleviating the friction points mentioned above.

@@ -234,6 +253,16 @@ select column1, nth_value(column3, 2 order by column2, column4 desc) from array_
b [4, 5, 6]
w [9, 5, 2]

query ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test for

-- default ordering (and show that desc is respected)
select array_agg(DISTINCT column2 order by column2) from array_agg_order_list_table;

Also a query with a GROUP BY as it goes through a different code path

Also the negative case (no order by but with distinct) -- to show the error message is wired up correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the default order should be ASC right? this is the standard default in pretty much all databases, and even in DataFusion link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly

@gabotechs
Copy link
Contributor Author

I also encourage you to help review other PRs that are flowing into the repo -- this will likely make it easier for you to get people to review your PRs

👍 Sounds fair, I'm willing to invest some time in this. If you have any pointers about how to classify PRs where reviews are most appreciated please let me me know.

@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from dbc8a75 to f194b23 Compare March 25, 2025 09:41
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 25, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @gabotechs @geoffreyclaude and @NGA-TRAN

let mut sort_option: Option<SortOptions> = None;
if let Some(order) = acc_args.ordering_req.first() {
if !order.expr.eq(&acc_args.exprs[0]) {
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in general I always prefer more specific error messages for better user experience even at the expense of more complicated code

@alamb
Copy link
Contributor

alamb commented Mar 25, 2025

👍 Sounds fair, I'm willing to invest some time in this. If you have any pointers about how to classify PRs where reviews are most appreciated please let me me know.

The ones that take the most time for me are the ones where the PR doesn't explicitly spell out what the implications (user visible behavior changes are). Maybe you can look at the open PRs waiting for review (I use this list) and if it isn't clear to you what impact they have on users, perhaps you can work with the authors to help add clarity

@alamb
Copy link
Contributor

alamb commented Mar 26, 2025

🚀

@alamb alamb merged commit a05514c into apache:main Mar 26, 2025
29 checks passed
qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025
* Add support for DISTINCT and ORDER BY in ARRAY_AGG

* Add DISTINCT + ORDER BY docs

* Add some more sqllogictests

* Update aggregate_functions.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_agg cannot perform both distinct and order_by
4 participants