-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DISTINCT + ORDER BY in ARRAY_AGG
#14413
Conversation
9400d72
to
8eaacd6
Compare
8eaacd6
to
f23681c
Compare
@@ -193,3 +193,149 @@ pub fn merge_ordered_arrays( | |||
|
|||
Ok((merged_values, merged_orderings)) | |||
} | |||
|
|||
#[cfg(test)] | |||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this tests from
mod tests { |
merge_ordered_arrays
function present in this file, and nothing related to ARRAY_AGG
.
As I added there some unit tests that do test ARRAY_AGG
, I though that it might be a good idea to move these ones out to a more suitable place.
if values.len() != 1 { | ||
return internal_err!("expects single batch"); | ||
if values.is_empty() { | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As now the distinct accumulator can accept more than 1 batch because of the ordering, removing this restriction was necessary.
Close/reopen to rerun CI |
f23681c
to
0d53376
Compare
// ARRAY_AGG(DISTINCT col ORDER BY other_col) <- Invalid | ||
// ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) <- Invalid | ||
if acc_args.ordering_req.len() > 1 { | ||
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a copy-paste typo of the error message? Shouldn't it rather be something more like:
return exec_err!("In an aggregate with DISTINCT, only one ORDER BY expression is allowed");
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restriction goes further than that: all the expressions used in the ORDER BY must appear in as arguments of the aggregation function. In this case, ARRAY_AGG accepts one argument, therefore only that 1 expression is also valid as an ORDER BY expression.
I took the error message from Postgres, but maybe it's better to customize it to the ARRAY_AGG function.
let mut sort_option: Option<SortOptions> = None; | ||
if let Some(order) = acc_args.ordering_req.first() { | ||
if !order.expr.eq(&acc_args.exprs[0]) { | ||
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can be only one ORDER BY
expression, the error message should refer to it in singular. I'd suggest:
return exec_err!("In an aggregate with DISTINCT, the ORDER BY expression must match the aggregate argument");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can be only one ORDER BY expression, the error message should refer to it in singular
This was pretty much the error message that Postgres would return, but they have this limitation as a general rule for all aggregation functions, which does not seem to be the case for DataFusion. The other option would be to fully customize the error message for the specific ARRAY_AGG case, which will result in a less standard but more informative error message. Any opinions here @alamb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in general I always prefer more specific error messages for better user experience even at the expense of more complicated code
@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { | |||
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?; | |||
|
|||
if acc_args.is_distinct { | |||
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?)); | |||
// Limitation similar to Postgres. The aggregation function can only mix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment is great! Would it be possible to make it more user-facing? Maybe as a rustdoc over the ArrayAgg
structure? As well as in the ArrayAgg
documentation if it exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to look for inspiration in other aggregation function docs, and they seem overly brief on purpose, no other doc mentions anything about DISTINCT clause behavior or anything like that. Maybe someone else can chime in and throw some light about what's the best approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add this description (about ORDER / DISTINCT in general) to https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#aggregate-functions
At the very least we shoudl add it to the documentation (in the user_doc
macro on this UDF definition) so it appears in https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#array-agg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the user_doc
with a brief explanation about how DISTINCT + ORDER BY work with the ARRAY_AGG function, along with an example.
Do you think the "Postgres style" limitation is something that could be applied to all aggregation functions? For the ARRAY_AGG specifically it seems like it delivers a good results/complexity ratio, but I imagine that applying this limitation globally would require at least some internal consensus from PMC people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @gabotechs
I suggests to add a few more corner and negative tests
0d53376
to
dbc8a75
Compare
Hi @alamb, any chance that someone takes a look at this one any time soon? I promise that 80% of the code are just new tests 🙏. |
ARRAY_AGG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gabotechs -- this is looking quite good to me. I left some comments, let me know what you think.
Sorry for the delay in reviewing -- I am in general swamped, but I do try and review as many PRs as I can. In general I have been prioritizing other things myself (see #15274 for example). I hope we can find some other committers who have time to review improvements to functionality.
I also encourage you to help review other PRs that are flowing into the repo -- this will likely make it easier for you to get people to review your PRs (as your contributions will be more visibile and you can help free up other reviewers' time for your PRs)
@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { | |||
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?; | |||
|
|||
if acc_args.is_distinct { | |||
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?)); | |||
// Limitation similar to Postgres. The aggregation function can only mix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add this description (about ORDER / DISTINCT in general) to https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#aggregate-functions
At the very least we shoudl add it to the documentation (in the user_doc
macro on this UDF definition) so it appears in https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#array-agg
Ok(()) | ||
} | ||
|
||
struct ArrayAggAccumulatorBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to https://docs.rs/datafusion/latest/datafusion/physical_expr/aggregate/struct.AggregateExprBuilder.html#method.distinct
Though it seems like that structure has no good example, perhaps you could make a PR to add an doc example of how to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue to track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I did look at the AggregateExprBuilder
, but did not find it suitable for what these tests try to accomplish. These are the fundamental differences:
- This
ArrayAggAccumulatorBuilder
is a builder for an accumulator, rather than an expression. - This
ArrayAggAccumulatorBuilder
is very tailored to these specific tests, automatically providing dummy arguments like a column named "col", or a schema with just one list field
I think there's an opportunity to have a generic AccumulatorBuilder
though, that could be used for unit testing these aggregation functions from a more internal standpoint, increasing coverage in the aggregate-functions
crate.
} | ||
} | ||
|
||
fn str_arr(value: ScalarValue) -> Result<Vec<Option<String>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make these tests easier to update using the pre-existing formatter: https://docs.rs/arrow/latest/arrow/util/pretty/fn.pretty_format_batches.html
Recently (after this PR was made) we have been migrating to insta
which might be helpful for this kind of test -- see for example #15364
You could create a RecordBatch and then do something like
assert_snapshot!(batches_to_sort_string(&batches), @r#"
+----+----+----+
| a1 | b1 | c1 |
+----+----+----+
| 5 | 5 | 50 |
+----+----+----+
"#);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the suggestion, and run in the the following blockers:
- These tests assert over a
ScalarValue
rather than aRecordBatch
, so intermediate wrappings with arbitrary field names for promoting aScalarValue
to aRecordBatch
are necessary (not a big deal). - The
batches_to_sort_string
function will order lines in the output, but will not order array scalar values inside single cells, which is required for this tests. - The more verbose assertions are suitable to be managed by insta, but will make it difficult to follow a "write the test first and the implementation later" approach, which is the approach used for writing these tests.
Do you think insta is a good tool for this kind of tests? I imagine that snapshot testing is suitable for tests that are expected to be updated often as the implementation changes, but one of the main points of these tests is that they should never change even if the implementation changes.
If you still think insta is a good tool for this unit tests, I can build some extra tooling for alleviating the friction points mentioned above.
@@ -234,6 +253,16 @@ select column1, nth_value(column3, 2 order by column2, column4 desc) from array_ | |||
b [4, 5, 6] | |||
w [9, 5, 2] | |||
|
|||
query ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a test for
-- default ordering (and show that desc is respected)
select array_agg(DISTINCT column2 order by column2) from array_agg_order_list_table;
Also a query with a GROUP BY
as it goes through a different code path
Also the negative case (no order by but with distinct) -- to show the error message is wired up correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the default order should be ASC
right? this is the standard default in pretty much all databases, and even in DataFusion link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly
👍 Sounds fair, I'm willing to invest some time in this. If you have any pointers about how to classify PRs where reviews are most appreciated please let me me know. |
dbc8a75
to
f194b23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @gabotechs @geoffreyclaude and @NGA-TRAN
let mut sort_option: Option<SortOptions> = None; | ||
if let Some(order) = acc_args.ordering_req.first() { | ||
if !order.expr.eq(&acc_args.exprs[0]) { | ||
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in general I always prefer more specific error messages for better user experience even at the expense of more complicated code
The ones that take the most time for me are the ones where the PR doesn't explicitly spell out what the implications (user visible behavior changes are). Maybe you can look at the open PRs waiting for review (I use this list) and if it isn't clear to you what impact they have on users, perhaps you can work with the authors to help add clarity |
🚀 |
* Add support for DISTINCT and ORDER BY in ARRAY_AGG * Add DISTINCT + ORDER BY docs * Add some more sqllogictests * Update aggregate_functions.md
Which issue does this PR close?
Rationale for this change
Completing ARRAY_AGG functionality as a prerequisite for adding the full functionality of STRING_AGG in #14412
What changes are included in this PR?
Adds a Postgres-style support for DISTINCT + ORDER_BY functionality, allowing users to issue statements like:
Note that there's a limitation that prohibits ordering by an expression that is not the same as the ARRAY_AGG argument. For example, the following queries are invalid:
This is the same limitation that exists on Postgres, example in Postgres fiddle
Are these changes tested?
yes, both in unit tests and sqllogictests
Are there any user-facing changes?
Users will now be able to issue ARRAY_AGG calls mixing DISTINCT and ORDER_BY clauses