Skip to content

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Sep 12, 2025

Which issue does this PR close?

Rationale for this change

So we can use these distinct aggregates via DataFrames

What changes are included in this PR?

Introduce avg_distinct() and sum_distinct() functions to be used in DataFrame API.

Are these changes tested?

Added DataFrame tests, also proto roundtrip test.

Are there any user-facing changes?

New functions, updated documentation for DataFrame functions.

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate functions Changes to functions implementation labels Sep 12, 2025
Comment on lines +65 to +74
pub fn avg_distinct(expr: Expr) -> Expr {
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
avg_udaf(),
vec![expr],
true,
None,
vec![],
None,
))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as how count handles it:

pub fn count_distinct(expr: Expr) -> Expr {
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
count_udaf(),
vec![expr],
true,
None,
vec![],
None,
))
}

Comment on lines +504 to +511
min(col("c4")).alias("min(c4)"),
max(col("c4")).alias("max(c4)"),
avg(col("c4")).alias("avg(c4)"),
avg_distinct(col("c4")).alias("avg_distinct(c4)"),
sum(col("c4")).alias("sum(c4)"),
sum_distinct(col("c4")).alias("sum_distinct(c4)"),
count(col("c4")).alias("count(c4)"),
count_distinct(col("c4")).alias("count_distinct(c4)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to c4 from c12 as c12 had some precision variations for avg_distinct leading to inconsistent test results, and figured it was easier to switch columns than slap round on the outputs

@github-actions github-actions bot added the proto Related to proto crate label Sep 14, 2025
@Jefffrey Jefffrey removed the proto Related to proto crate label Sep 14, 2025
@Omega359
Copy link
Contributor

I think these new functions should be added to the default list in the all_default_aggregate_functions function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce avg_distinct() function to dataframe Introduce sum_distinct() function to dataframe
2 participants