Simplify percentile_cont for 0/1 percentiles #18837

kumarUjjawal · 2025-11-20T07:12:10Z

Which issue does this PR close?

Closes Simplify percentile_cont to min/max when percentile is 0 or 1 #18108

Rationale for this change

Literal 0/1 percentiles don’t need percentile buffering; using min/max keeps results identical.

What changes are included in this PR?

Add a simplify hook so percentile_cont(..., 0|1) rewrites to min/max, preserving distinct/filter/null handling and casting ints to Float64.
Add targeted tests for the rewrite and for the no‑rewrite path.

Are these changes tested?

Added tests

Are there any user-facing changes?

2010YOUY01 · 2025-11-20T09:20:09Z

datafusion/functions-aggregate/src/percentile_cont.rs

 }
+
+#[cfg(test)]
+mod tests {


I suggest to add some tests in sqllogictest https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

It should run some SQL queries that this optimization is applicable, and we first ensure the result is expected, and also do a EXPLAIN to ensure such optimization is applied.

In fact, we can move most of the test coverage to sqllogictests, instead of unit tests here. The reason is:

SQL tests are simpler to maintain

The SQL interface is more stable, while internal APIs may change frequently. As a result, good test coverage here can easily get lost during refactoring.

I kept the unit tests along with the new sql test in the sqllogictest. Should I remove the unit tests or is it okay?

We should remove the unit tests if they duplicate the sqllogictests

We should remove the unit tests if they duplicate the sqllogictests

+1 unless there are something can't be covered by slt tests

datafusion/functions-aggregate/src/percentile_cont.rs

Co-authored-by: Martin Grigorov <[email protected]>

Jefffrey

Thanks for picking this up. Have a few suggestions to simplify the code

Jefffrey · 2025-11-21T02:06:06Z

datafusion/functions-aggregate/src/percentile_cont.rs

 }
+
+#[cfg(test)]
+mod tests {


We should remove the unit tests if they duplicate the sqllogictests

datafusion/functions-aggregate/src/percentile_cont.rs

Jefffrey · 2025-11-21T02:13:39Z

datafusion/functions-aggregate/src/percentile_cont.rs

+        Expr::Alias(alias) => extract_percentile_literal(alias.expr.as_ref()),
+        Expr::Cast(cast) => extract_percentile_literal(cast.expr.as_ref()),
+        Expr::TryCast(cast) => extract_percentile_literal(cast.expr.as_ref()),


How strictly necessary are these other arms? Is checking only for Literal not sufficient?

Jefffrey · 2025-11-21T02:15:02Z

datafusion/functions-aggregate/src/percentile_cont.rs

+    (value - target).abs() < PERCENTILE_LITERAL_EPSILON
+}
+
+fn percentile_cont_result_type(input_type: &DataType) -> Option<DataType> {


We should reuse the code from return_type if possible instead of duplicating it here

datafusion/datafusion/functions-aggregate/src/percentile_cont.rs

Lines 232 to 261 in f1ecacc

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {

if !arg_types[0].is_numeric() {

return plan_err!("percentile_cont requires numeric input types");

}

// PERCENTILE_CONT performs linear interpolation and should return a float type

// For integer inputs, return Float64 (matching PostgreSQL/DuckDB behavior)

// For float inputs, preserve the float type

match &arg_types[0] {

DataType::Float16 | DataType::Float32 | DataType::Float64 => {

Ok(arg_types[0].clone())

}

DataType::Decimal32(_, _)

| DataType::Decimal64(_, _)

| DataType::Decimal128(_, _)

| DataType::Decimal256(_, _) => Ok(arg_types[0].clone()),

DataType::UInt8

| DataType::UInt16

| DataType::UInt32

| DataType::UInt64

| DataType::Int8

| DataType::Int16

| DataType::Int32

| DataType::Int64 => Ok(DataType::Float64),

// Shouldn't happen due to signature check, but just in case

dt => plan_err!(

"percentile_cont does not support input type {}, must be numeric",

dt

),

}

}

Jefffrey · 2025-11-21T02:16:22Z

datafusion/functions-aggregate/src/percentile_cont.rs

+fn nearly_equals_fraction(value: f64, target: f64) -> bool {
+    (value - target).abs() < PERCENTILE_LITERAL_EPSILON
+}


I'm personally of the mind to check directly against 0.0 and 1.0 instead of doing an epsilon check; I think it's more likely a user would input an expr like SELECT percentile_cont(column1, 0.0) than doing something like SELECT percentile_cont(column1, expr) where expr might be some math that could make it 0.0000001 🤔

datafusion/functions-aggregate/src/percentile_cont.rs

Jefffrey · 2025-11-21T02:17:35Z

datafusion/functions-aggregate/src/percentile_cont.rs

+    let mut agg_arg = value_expr;
+    if expected_return_type != input_type {
+        agg_arg = Expr::Cast(Cast::new(Box::new(agg_arg), expected_return_type.clone()));
+    }


Can we explain why this is necessary in a comment here?

Jefffrey · 2025-11-21T02:17:58Z

datafusion/functions-aggregate/src/percentile_cont.rs

+    let rewrite_target = match classify_rewrite_target(percentile_value, is_descending) {
+        Some(target) => target,
+        None => return Ok(original_expr),
+    };


I feel this should be folded directly into line 400 above, instead of splitting it like this

Jefffrey · 2025-11-21T02:18:42Z

datafusion/functions-aggregate/src/percentile_cont.rs

+    }
+}
+
+fn literal_scalar_to_f64(value: &ScalarValue) -> Option<f64> {


Can we have percentiles that are not of type Flaot64? I thought the signature guarded us against this

datafusion/datafusion/functions-aggregate/src/percentile_cont.rs

Lines 142 to 154 in f1ecacc

pub fn new() -> Self {

let mut variants = Vec::with_capacity(NUMERICS.len());

// Accept any numeric value paired with a float64 percentile

for num in NUMERICS {

variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64]));

}

Self {

signature: Signature::one_of(variants, Volatility::Immutable)

.with_parameter_names(vec!["expr".to_string(), "percentile".to_string()])

.expect("valid parameter names for percentile_cont"),

aliases: vec![String::from("quantile_cont")],

}

}

Co-authored-by: Jeffrey Vo <[email protected]>

Simplify percentile_cont for 0/1 percentiles

89f2213

github-actions bot added the functions Changes to functions implementation label Nov 20, 2025

fix sqllogic error

e158fd4

2010YOUY01 reviewed Nov 20, 2025

View reviewed changes

added the test in sqllogictests files

64b047d

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 20, 2025

martin-g reviewed Nov 20, 2025

View reviewed changes

datafusion/functions-aggregate/src/percentile_cont.rs Outdated Show resolved Hide resolved

datafusion/functions-aggregate/src/percentile_cont.rs Outdated Show resolved Hide resolved

datafusion/functions-aggregate/src/percentile_cont.rs Outdated Show resolved Hide resolved

kumarUjjawal and others added 3 commits November 20, 2025 21:06

minor changes

812d370

Co-authored-by: Martin Grigorov <[email protected]>

Merge branch 'main' into feat/percentile-min-max

19667af

fix failing ci

6696bc8

kumarUjjawal requested a review from 2010YOUY01 November 20, 2025 16:56

Jefffrey reviewed Nov 21, 2025

View reviewed changes

kumarUjjawal and others added 3 commits November 21, 2025 13:29

nit

2251dfa

Co-authored-by: Jeffrey Vo <[email protected]>

minor

fc46939

Co-authored-by: Jeffrey Vo <[email protected]>

removed unit tests and other ergonomica changes with commnets

4a10e04

	fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
	if !arg_types[0].is_numeric() {
	return plan_err!("percentile_cont requires numeric input types");
	}
	// PERCENTILE_CONT performs linear interpolation and should return a float type
	// For integer inputs, return Float64 (matching PostgreSQL/DuckDB behavior)
	// For float inputs, preserve the float type
	match &arg_types[0] {
	DataType::Float16 \| DataType::Float32 \| DataType::Float64 => {
	Ok(arg_types[0].clone())
	}
	DataType::Decimal32(_, _)
	\| DataType::Decimal64(_, _)
	\| DataType::Decimal128(_, _)
	\| DataType::Decimal256(_, _) => Ok(arg_types[0].clone()),
	DataType::UInt8
	\| DataType::UInt16
	\| DataType::UInt32
	\| DataType::UInt64
	\| DataType::Int8
	\| DataType::Int16
	\| DataType::Int32
	\| DataType::Int64 => Ok(DataType::Float64),
	// Shouldn't happen due to signature check, but just in case
	dt => plan_err!(
	"percentile_cont does not support input type {}, must be numeric",
	dt
	),
	}
	}

	pub fn new() -> Self {
	let mut variants = Vec::with_capacity(NUMERICS.len());
	// Accept any numeric value paired with a float64 percentile
	for num in NUMERICS {
	variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64]));
	}
	Self {
	signature: Signature::one_of(variants, Volatility::Immutable)
	.with_parameter_names(vec!["expr".to_string(), "percentile".to_string()])
	.expect("valid parameter names for percentile_cont"),
	aliases: vec![String::from("quantile_cont")],
	}
	}

Simplify percentile_cont for 0/1 percentiles #18837

Are you sure you want to change the base?

Simplify percentile_cont for 0/1 percentiles #18837

Conversation

kumarUjjawal commented Nov 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants