-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Introduce avg_distinct()
and sum_distinct()
functions to DataFrame API
#17536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pub fn avg_distinct(expr: Expr) -> Expr { | ||
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf( | ||
avg_udaf(), | ||
vec![expr], | ||
true, | ||
None, | ||
vec![], | ||
None, | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as how count handles it:
datafusion/datafusion/functions-aggregate/src/count.rs
Lines 71 to 80 in bfc5067
pub fn count_distinct(expr: Expr) -> Expr { | |
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf( | |
count_udaf(), | |
vec![expr], | |
true, | |
None, | |
vec![], | |
None, | |
)) | |
} |
min(col("c4")).alias("min(c4)"), | ||
max(col("c4")).alias("max(c4)"), | ||
avg(col("c4")).alias("avg(c4)"), | ||
avg_distinct(col("c4")).alias("avg_distinct(c4)"), | ||
sum(col("c4")).alias("sum(c4)"), | ||
sum_distinct(col("c4")).alias("sum_distinct(c4)"), | ||
count(col("c4")).alias("count(c4)"), | ||
count_distinct(col("c4")).alias("count_distinct(c4)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to c4
from c12
as c12
had some precision variations for avg_distinct leading to inconsistent test results, and figured it was easier to switch columns than slap round
on the outputs
I think these new functions should be added to the default list in the |
Which issue does this PR close?
sum_distinct()
function to dataframe #2407avg_distinct()
function to dataframe #2409Rationale for this change
So we can use these distinct aggregates via DataFrames
What changes are included in this PR?
Introduce
avg_distinct()
andsum_distinct()
functions to be used in DataFrame API.Are these changes tested?
Added DataFrame tests, also proto roundtrip test.
Are there any user-facing changes?
New functions, updated documentation for DataFrame functions.