20006: feat(spark): Adds negative spark function#214
20006: feat(spark): Adds negative spark function#214martin-augment wants to merge 17 commits intomainfrom
Conversation
Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Oleks V <[email protected]>
Co-authored-by: Oleks V <[email protected]>
Co-authored-by: Oleks V <[email protected]>
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Spark-compatible Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe PR adds a new NEGATIVE unary function to DataFusion's Spark implementation. It includes a new module with a SparkNegative UDF struct implementing scalar negation across multiple data types (integers, floats, decimals, intervals, and arrays). Integer and interval types use wrapping behavior, while floating-point types use standard negation. The function returns NULL for NULL inputs and errors for unsupported types. The implementation integrates with DataFusion's ScalarUDFImpl interface. A comprehensive test suite covers various data types and edge cases including wrap-around behavior and special floating-point values. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: feat(spark): Adds negative spark functionThank you for implementing the ✅ Strengths
🐛 Issues Found1. Missing Unsigned Integer HandlingLocation: DataType::Null
| DataType::UInt8
| DataType::UInt16
| DataType::UInt32
| DataType::UInt64 => Ok(args[0].clone()),Issue: For unsigned integer types, the function simply returns the input unchanged. This is incorrect - Spark's Expected Behavior:
Recommendation: Either implement proper unsigned integer negation with type promotion, or add a comment explaining why these types are not supported if that's intentional. 2. Missing Unit TestsLocation: Issue: The implementation lacks Rust unit tests (no
Recommendation: Add a
3. Incomplete Argument ValidationLocation: let [_] = take_function_args("negative", args)?;Issue: The underscore pattern ignores the extracted argument but then accesses Recommendation: Either use the extracted argument or use a more explicit validation pattern like: if args.len() \!= 1 {
return exec_err\!("negative expects exactly one argument");
}
|
There was a problem hiding this comment.
Code Review
This pull request adds a Spark-compatible negative function. The implementation is mostly correct, covering various numeric, decimal, and interval types, and includes comprehensive tests. However, I've identified a significant issue with the handling of unsigned integer types, where the function either returns an incorrect value or a non-user-friendly error. My review comments focus on correcting this behavior to ensure the function is robust and provides clear feedback on unsupported operations.
| DataType::Null | ||
| | DataType::UInt8 | ||
| | DataType::UInt16 | ||
| | DataType::UInt32 | ||
| | DataType::UInt64 => Ok(args[0].clone()), |
There was a problem hiding this comment.
The current implementation for unsigned integer types (UInt*) in array form incorrectly returns the original value. For example, negative(1) would return 1. Since negating a positive unsigned integer is not a valid operation within its type, this should result in an error. This change corrects the behavior to return an execution error, which is more appropriate and aligns with expectations for such an operation.
| DataType::Null | |
| | DataType::UInt8 | |
| | DataType::UInt16 | |
| | DataType::UInt32 | |
| | DataType::UInt64 => Ok(args[0].clone()), | |
| DataType::Null => Ok(args[0].clone()), | |
| DataType::UInt8 | |
| | DataType::UInt16 | |
| | DataType::UInt32 | |
| | DataType::UInt64 => { | |
| return exec_err!("The operator 'negative' is not defined for unsigned integers."); | |
| }, |
There was a problem hiding this comment.
value:good-but-wont-fix; category:bug; feedback: The Gemini AI reviewer is correct! Doing nothing for unsigned integers would be surprising but none of the other DB solutions support this too. Spark does not support unsigned integers, Postgres/DuckDB do not support negative() function.
| })), | ||
| )), | ||
|
|
||
| dt => not_impl_err!("Not supported datatype for Spark NEGATIVE: {dt}"), |
There was a problem hiding this comment.
The scalar implementation for negative does not handle unsigned integer types, causing it to fall through to a not_impl_err!. This is inconsistent with the array implementation and not user-friendly. It should return a proper execution error. This change adds handling for unsigned scalar values to return an execution error, making the behavior consistent for both scalar and array inputs.
ScalarValue::UInt8(Some(_))
| ScalarValue::UInt16(Some(_))
| ScalarValue::UInt32(Some(_))
| ScalarValue::UInt64(Some(_)) => {
return exec_err!("The operator 'negative' is not defined for unsigned integers.");
}
dt => not_impl_err!("Not supported datatype for Spark NEGATIVE: {dt}"),There was a problem hiding this comment.
value:useful; category:bug; feedback:The Gemini AI reviewer is correct! The unsigned integers are supported for arrays but not for scalars. This is inconsistent and support for scalars should be added. Prevents returning an error for using unsigned integers as scalar values.
| use arrow::datatypes::{DataType, IntervalDayTime, IntervalMonthDayNano}; | ||
| use bigdecimal::num_traits::WrappingNeg; | ||
| use datafusion_common::utils::take_function_args; | ||
| use datafusion_common::{Result, ScalarValue, not_impl_err}; |
There was a problem hiding this comment.
To return a user-facing execution error for invalid inputs, exec_err should be imported from datafusion_common. This will be used to handle unsupported operations on unsigned integers.
| use datafusion_common::{Result, ScalarValue, not_impl_err}; | |
| use datafusion_common::{exec_err, not_impl_err, Result, ScalarValue}; |
🤖 Augment PR SummarySummary: Adds a Spark-compatible Changes:
Technical Notes: Integer negation uses wrapping semantics to match Spark legacy (non-ANSI) behavior; ANSI-mode overflow errors are documented as not implemented yet. 🤖 Was this summary useful? React with 👍 or 👎 |
|
|
||
| match &args[0] { | ||
| ColumnarValue::Array(array) => match array.data_type() { | ||
| DataType::Null |
There was a problem hiding this comment.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The unsigned integers are supported for arrays but not for scalars. This is inconsistent and support for scalars should be added. Prevents returning an error for using unsigned integers as scalar values.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@datafusion/spark/src/function/math/negative.rs`:
- Around line 92-96: The scalar match in negative.rs is missing arms for
unsigned integer ScalarValue variants, causing UInt scalars to fall through to
not_impl_err! while UInt arrays are returned unchanged; add match arms for
ScalarValue::UInt8, ScalarValue::UInt16, ScalarValue::UInt32, and
ScalarValue::UInt64 to the scalar branch (the same section that currently
handles Int*/Float*/Null) and return Ok(args[0].clone()) for each so scalar
UInts behave consistently with the array branch (mirror how abs.rs handles UInt
scalar cases).
🧹 Nitpick comments (1)
datafusion/sqllogictest/test_files/spark/math/negative.slt (1)
105-116: Add interval test cases to match new interval support.
Line 105-116 extends coverage to decimals, but there are still no tests for IntervalYearMonth/IntervalDayTime/IntervalMonthDayNano even though the implementation supports them. Adding a few interval cases would prevent regressions and validate the interval branch.
| DataType::Null | ||
| | DataType::UInt8 | ||
| | DataType::UInt16 | ||
| | DataType::UInt32 | ||
| | DataType::UInt64 => Ok(args[0].clone()), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's see the file structure and read the negative.rs file
cd datafusion/spark/src/function/math || exit 1
wc -l negative.rsRepository: martin-augment/datafusion
Length of output: 84
🏁 Script executed:
#!/bin/bash
# Read the array handling section (lines 85-105 to get context)
sed -n '85,105p' datafusion/spark/src/function/math/negative.rsRepository: martin-augment/datafusion
Length of output: 943
🏁 Script executed:
#!/bin/bash
# Read the scalar handling section (lines 185-200 to see the scalar branch)
sed -n '185,200p' datafusion/spark/src/function/math/negative.rsRepository: martin-augment/datafusion
Length of output: 717
🏁 Script executed:
#!/bin/bash
# Get all UInt cases in the scalar branch (lines 185-280)
sed -n '185,280p' datafusion/spark/src/function/math/negative.rs | grep -n "UInt\|DataType::" | head -30Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
#!/bin/bash
# Check how other math UDFs handle unsigned types
ls -la datafusion/spark/src/function/math/*.rs | head -10Repository: martin-augment/datafusion
Length of output: 1017
🏁 Script executed:
#!/bin/bash
# Search for patterns in other math functions (e.g., abs.rs, cast behavior)
rg -A 3 "UInt8|UInt16|UInt32|UInt64" datafusion/spark/src/function/math/ -g '*.rs' | head -50Repository: martin-augment/datafusion
Length of output: 3246
🏁 Script executed:
#!/bin/bash
# Read the entire scalar branch section (line 192 to the end)
sed -n '192,280p' datafusion/spark/src/function/math/negative.rsRepository: martin-augment/datafusion
Length of output: 3703
Add missing UInt cases to scalar branch for consistent behavior.*
Array branch (lines 92-96) passes unsigned integer types unchanged, but the scalar branch (lines 192+) lacks any ScalarValue::UInt* cases and falls through to not_impl_err!, creating inconsistent behavior. When a UInt array flows through, it returns unchanged; a UInt scalar fails instead. For consistency with other math functions like abs.rs, add scalar UInt handling:
ScalarValue::Int64(Some(v)) => {
let result = v.wrapping_neg();
Ok(ColumnarValue::Scalar(ScalarValue::Int64(Some(result))))
}
+ // Unsigned integers - return unchanged
+ ScalarValue::UInt8(_)
+ | ScalarValue::UInt16(_)
+ | ScalarValue::UInt32(_)
+ | ScalarValue::UInt64(_) => Ok(args[0].clone()),
+
// Floating point - simple negation
ScalarValue::Float16(Some(v)) => {🤖 Prompt for AI Agents
In `@datafusion/spark/src/function/math/negative.rs` around lines 92 - 96, The
scalar match in negative.rs is missing arms for unsigned integer ScalarValue
variants, causing UInt scalars to fall through to not_impl_err! while UInt
arrays are returned unchanged; add match arms for ScalarValue::UInt8,
ScalarValue::UInt16, ScalarValue::UInt32, and ScalarValue::UInt64 to the scalar
branch (the same section that currently handles Int*/Float*/Null) and return
Ok(args[0].clone()) for each so scalar UInts behave consistently with the array
branch (mirror how abs.rs handles UInt scalar cases).
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The unsigned integers are supported for arrays but not for scalars. This is inconsistent and support for scalars should be added. Prevents returning an error for using unsigned integers as scalar values.
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Neither Apache Spark, nor PostgreSQL/DuckDB supports negation of "unsigned" types in their SQL. |
value:incorrect-but-reasonable; category:bug; feedback:The Claude AI reviewer is not correct! The Apache DataFusion team prefers the tests to be implemented in SQL Logic test (.slt) form than as unit tests. SLT tests do not need re-compilation, so they are faster to implement and iterate on. |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Support for interval types has been added after a request from a maintainer but there are no SLT tests for these types. Adding tests would prevent regressions in the future which might not be noticed now. |
20006: To review by AI