-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Simplify percentile_cont for 0/1 percentiles #18837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add some tests in sqllogictest https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
It should run some SQL queries that this optimization is applicable, and we first ensure the result is expected, and also do a EXPLAIN to ensure such optimization is applied.
In fact, we can move most of the test coverage to sqllogictests, instead of unit tests here. The reason is:
- SQL tests are simpler to maintain
- The SQL interface is more stable, while internal APIs may change frequently. As a result, good test coverage here can easily get lost during refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the unit tests along with the new sql test in the sqllogictest. Should I remove the unit tests or is it okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the unit tests if they duplicate the sqllogictests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the unit tests if they duplicate the sqllogictests
+1 unless there are something can't be covered by slt tests
Co-authored-by: Martin Grigorov <[email protected]>
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up. Have a few suggestions to simplify the code
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the unit tests if they duplicate the sqllogictests
| Expr::Alias(alias) => extract_percentile_literal(alias.expr.as_ref()), | ||
| Expr::Cast(cast) => extract_percentile_literal(cast.expr.as_ref()), | ||
| Expr::TryCast(cast) => extract_percentile_literal(cast.expr.as_ref()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How strictly necessary are these other arms? Is checking only for Literal not sufficient?
| (value - target).abs() < PERCENTILE_LITERAL_EPSILON | ||
| } | ||
|
|
||
| fn percentile_cont_result_type(input_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reuse the code from return_type if possible instead of duplicating it here
datafusion/datafusion/functions-aggregate/src/percentile_cont.rs
Lines 232 to 261 in f1ecacc
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |
| if !arg_types[0].is_numeric() { | |
| return plan_err!("percentile_cont requires numeric input types"); | |
| } | |
| // PERCENTILE_CONT performs linear interpolation and should return a float type | |
| // For integer inputs, return Float64 (matching PostgreSQL/DuckDB behavior) | |
| // For float inputs, preserve the float type | |
| match &arg_types[0] { | |
| DataType::Float16 | DataType::Float32 | DataType::Float64 => { | |
| Ok(arg_types[0].clone()) | |
| } | |
| DataType::Decimal32(_, _) | |
| | DataType::Decimal64(_, _) | |
| | DataType::Decimal128(_, _) | |
| | DataType::Decimal256(_, _) => Ok(arg_types[0].clone()), | |
| DataType::UInt8 | |
| | DataType::UInt16 | |
| | DataType::UInt32 | |
| | DataType::UInt64 | |
| | DataType::Int8 | |
| | DataType::Int16 | |
| | DataType::Int32 | |
| | DataType::Int64 => Ok(DataType::Float64), | |
| // Shouldn't happen due to signature check, but just in case | |
| dt => plan_err!( | |
| "percentile_cont does not support input type {}, must be numeric", | |
| dt | |
| ), | |
| } | |
| } |
| fn nearly_equals_fraction(value: f64, target: f64) -> bool { | ||
| (value - target).abs() < PERCENTILE_LITERAL_EPSILON | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally of the mind to check directly against 0.0 and 1.0 instead of doing an epsilon check; I think it's more likely a user would input an expr like SELECT percentile_cont(column1, 0.0) than doing something like SELECT percentile_cont(column1, expr) where expr might be some math that could make it 0.0000001 🤔
| let mut agg_arg = value_expr; | ||
| if expected_return_type != input_type { | ||
| agg_arg = Expr::Cast(Cast::new(Box::new(agg_arg), expected_return_type.clone())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain why this is necessary in a comment here?
| let rewrite_target = match classify_rewrite_target(percentile_value, is_descending) { | ||
| Some(target) => target, | ||
| None => return Ok(original_expr), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this should be folded directly into line 400 above, instead of splitting it like this
| } | ||
| } | ||
|
|
||
| fn literal_scalar_to_f64(value: &ScalarValue) -> Option<f64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have percentiles that are not of type Flaot64? I thought the signature guarded us against this
datafusion/datafusion/functions-aggregate/src/percentile_cont.rs
Lines 142 to 154 in f1ecacc
| pub fn new() -> Self { | |
| let mut variants = Vec::with_capacity(NUMERICS.len()); | |
| // Accept any numeric value paired with a float64 percentile | |
| for num in NUMERICS { | |
| variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64])); | |
| } | |
| Self { | |
| signature: Signature::one_of(variants, Volatility::Immutable) | |
| .with_parameter_names(vec!["expr".to_string(), "percentile".to_string()]) | |
| .expect("valid parameter names for percentile_cont"), | |
| aliases: vec![String::from("quantile_cont")], | |
| } | |
| } |
Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Jeffrey Vo <[email protected]>
Which issue does this PR close?
percentile_contto min/max when percentile is 0 or 1 #18108Rationale for this change
Literal 0/1 percentiles don’t need percentile buffering; using min/max keeps results identical.
What changes are included in this PR?
Are these changes tested?
Added tests
Are there any user-facing changes?