-
Notifications
You must be signed in to change notification settings - Fork 0
20079: Use return_field_from_args in information schema and date_trunc #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant improvement in how DataFusion determines the return types of User-Defined Functions (UDFs), Aggregate User-Defined Functions (UDAFs), and Window User-Defined Functions (UDWFs) within its information schema. By transitioning from the simpler Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request refactors UDF return type computation in DataFusion. Changes to the information schema update how scalar, aggregate, and window UDFs compute return types by constructing FieldRef arguments and deriving types from WindowUDFFieldArgs and ReturnFieldArgs instead of relying on return_type/return_field methods. The date_trunc function is updated to enforce the new pattern: its return_type method now signals an internal error directing usage toward return_field_from_args, which derives the return type from the second argument's field properties. This represents a shift from legacy type computation to field-based type derivation across the codebase. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Use return_field_from_args in information schema and date_truncSummaryThis PR updates the information schema and date_trunc function to use the newer Code Quality & Best Practices ✅Strengths:
Minor observations:
Information Schema Changes (information_schema.rs)Scalar UDF (get_udf_args_and_return_types)Before: Called // Good: Properly constructs arg_fields from arg_types
let arg_fields: Vec<FieldRef> = arg_types
.iter()
.enumerate()
.map(|(i, t)| Arc::new(Field::new(format!("arg_{i}"), t.clone(), true)))
.collect();✅ Correct implementation - extracts data_type from the returned Field Aggregate UDF (get_udaf_args_and_return_types)Change: Now uses ✅ Consistent pattern - mirrors the scalar UDF approach Window UDF (get_udwf_args_and_return_types)Before: Returned 🎯 This is a bug fix! The old code always returned date_trunc Changes (date_trunc.rs)Before: fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
if arg_types[1].is_null() {
Ok(Timestamp(Nanosecond, None))
} else {
Ok(arg_types[1].clone())
}
}After: fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}
fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let field = &args.arg_fields[1];
let return_type = if field.data_type().is_null() {
Timestamp(Nanosecond, None)
} else {
field.data_type().clone()
};
Ok(Arc::new(Field::new(
self.name(),
return_type,
field.is_nullable(),
)))
}✅ Good migration pattern:
Potential Issues & Concerns1. Array Indexing Without Bounds Check
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly refactors information_schema and date_trunc to use the more modern return_field_from_args and related APIs. The changes are well-implemented and improve correctness, especially the fix for determining window function return types in the information schema.
However, there is a critical issue: the tests have not been updated to reflect these changes. The change to get_udwf_args_and_return_types in information_schema.rs is a fix that now allows correctly reporting return types for window functions like rank(). The existing sqllogictest in datafusion/sqllogictest/test_files/information_schema.slt still asserts the old, incorrect behavior (e.g., a NULL return type for rank()). Consequently, the CI tests will likely fail.
To resolve this, the test outputs in datafusion/sqllogictest/test_files/information_schema.slt must be updated to match the new, correct behavior. Specifically:
- The test for
information_schema.routinesshould be updated to showUInt64as thedata_typeforrank. - The test for
information_schema.parametersshould be updated to include anOUTparameter of typeUInt64forrank.
Without these test updates, the correctness of this important fix is not verified, and the PR cannot be merged.
🤖 Augment PR SummarySummary: Updates return-type introspection to use field-based APIs instead of only Changes:
🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Ok(arg_types[1].clone()) | ||
| } | ||
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
| internal_err!("return_field_from_args should be called instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since DateTruncFunc::return_type now always errors, datafusion/functions/benches/date_trunc.rs (around lines 60–63) will now panic when building return_field via udf.return_type(...). Consider updating the benchmark to derive the return field via return_field_from_args (or otherwise avoid calling return_type).
Other Locations
datafusion/functions/benches/date_trunc.rs:60
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The Augment AI reviewer is the only one that checked this unchanged but related file in the repository and found the issue! The benchmark tests still compile because the old method is still there but since they are not executed by the CI the runtime error is not detected by the Pull Request author. Prevents detecting that the benchmark test is broken in a later stage, possible by another developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@datafusion/functions/src/datetime/date_trunc.rs`:
- Around line 226-241: The method return_field_from_args currently indexes
args.arg_fields[1] directly and can panic if fewer than two args are passed;
update return_field_from_args to first validate args.arg_fields has at least 2
elements (or use args.arg_fields.get(1)) and return a Result::Err (use the
existing internal_err! or another appropriate error helper) when missing, then
proceed to derive return_type from the second arg as before (referencing
return_field_from_args, args.arg_fields, and self.name()) so the function
returns a proper error instead of panicking.
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
| internal_err!("return_field_from_args should be called instead") | ||
| } | ||
|
|
||
| fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> { | ||
| let data_types = args | ||
| .arg_fields | ||
| .iter() | ||
| .map(|f| f.data_type()) | ||
| .cloned() | ||
| .collect::<Vec<_>>(); | ||
| let return_type = self.return_type(&data_types)?; | ||
| let field = &args.arg_fields[1]; | ||
| let return_type = if field.data_type().is_null() { | ||
| Timestamp(Nanosecond, None) | ||
| } else { | ||
| field.data_type().clone() | ||
| }; | ||
| Ok(Arc::new(Field::new( | ||
| self.name(), | ||
| return_type, | ||
| args.arg_fields[1].is_nullable(), | ||
| field.is_nullable(), | ||
| ))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing args to avoid panic.
Line 231 indexes args.arg_fields[1] directly. If this method is ever invoked with fewer than two args (planner bug or misuse), it will panic instead of returning a user-facing error.
🛠️ Proposed fix
- let field = &args.arg_fields[1];
+ let field = args
+ .arg_fields
+ .get(1)
+ .ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | |
| internal_err!("return_field_from_args should be called instead") | |
| } | |
| fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> { | |
| let data_types = args | |
| .arg_fields | |
| .iter() | |
| .map(|f| f.data_type()) | |
| .cloned() | |
| .collect::<Vec<_>>(); | |
| let return_type = self.return_type(&data_types)?; | |
| let field = &args.arg_fields[1]; | |
| let return_type = if field.data_type().is_null() { | |
| Timestamp(Nanosecond, None) | |
| } else { | |
| field.data_type().clone() | |
| }; | |
| Ok(Arc::new(Field::new( | |
| self.name(), | |
| return_type, | |
| args.arg_fields[1].is_nullable(), | |
| field.is_nullable(), | |
| ))) | |
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | |
| internal_err!("return_field_from_args should be called instead") | |
| } | |
| fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> { | |
| let field = args | |
| .arg_fields | |
| .get(1) | |
| .ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?; | |
| let return_type = if field.data_type().is_null() { | |
| Timestamp(Nanosecond, None) | |
| } else { | |
| field.data_type().clone() | |
| }; | |
| Ok(Arc::new(Field::new( | |
| self.name(), | |
| return_type, | |
| field.is_nullable(), | |
| ))) |
🤖 Prompt for AI Agents
In `@datafusion/functions/src/datetime/date_trunc.rs` around lines 226 - 241, The
method return_field_from_args currently indexes args.arg_fields[1] directly and
can panic if fewer than two args are passed; update return_field_from_args to
first validate args.arg_fields has at least 2 elements (or use
args.arg_fields.get(1)) and return a Result::Err (use the existing internal_err!
or another appropriate error helper) when missing, then proceed to derive
return_type from the second arg as before (referencing return_field_from_args,
args.arg_fields, and self.name()) so the function returns a proper error instead
of panicking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| .map(|(i, t)| { | ||
| Arc::new(Field::new(format!("arg_{i}"), t.clone(), true)) | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated arg_fields construction logic across three functions
Low Severity
The same code block to build arg_fields from arg_types is duplicated verbatim in get_udf_args_and_return_types, get_udaf_args_and_return_types, and get_udwf_args_and_return_types. This could be extracted into a helper function like fn build_arg_fields(arg_types: &[DataType]) -> Vec<FieldRef>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:bug; feedback: The Bugbot AI reviewer is correct! There is some duplicated code that would be good to be extracted to a helper function and reused. Prevents double maintenance and possible regressions if a fix is applied at only one of the copies.
value:incorrect-but-reasonable; category:bug; feedback: The Claude AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check), |


20079: To review by AI