Skip to content

Conversation

@martin-augment
Copy link
Owner

20079: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant improvement in how DataFusion determines the return types of User-Defined Functions (UDFs), Aggregate User-Defined Functions (UDAFs), and Window User-Defined Functions (UDWFs) within its information schema. By transitioning from the simpler return_type method to the more context-aware return_field_from_args, the system can now infer return types with greater precision, taking into account the full field properties of input arguments. This change is also consistently applied to the date_trunc function, ensuring that its return type is derived using the new, more robust mechanism.

Highlights

  • Information Schema Refactoring: Refactored the information schema generation for User-Defined Functions (UDFs), Aggregate User-Defined Functions (UDAFs), and Window User-Defined Functions (UDWFs) to utilize the more robust return_field_from_args method for accurate return type inference.
  • Date_trunc Function Update: Updated the date_trunc function to implement return_field_from_args and explicitly deprecate its return_type method, ensuring consistency with the new type inference mechanism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 31, 2026

Walkthrough

This pull request refactors UDF return type computation in DataFusion. Changes to the information schema update how scalar, aggregate, and window UDFs compute return types by constructing FieldRef arguments and deriving types from WindowUDFFieldArgs and ReturnFieldArgs instead of relying on return_type/return_field methods. The date_trunc function is updated to enforce the new pattern: its return_type method now signals an internal error directing usage toward return_field_from_args, which derives the return type from the second argument's field properties. This represents a shift from legacy type computation to field-based type derivation across the codebase.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20079-2026-01-31-11-57-52

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 31, 2026

PR Review: Use return_field_from_args in information schema and date_trunc

Summary

This PR updates the information schema and date_trunc function to use the newer return_field_from_args API instead of the older return_type method. This is a good refactoring that aligns with the modern DataFusion UDF API.


Code Quality & Best Practices ✅

Strengths:

  • Consistent API migration: The changes correctly migrate from the deprecated return_type() to return_field_from_args() across scalar, aggregate, and window UDFs
  • Code consistency: All three helper functions (get_udf_args_and_return_types, get_udaf_args_and_return_types, get_udwf_args_and_return_types) now follow the same pattern
  • Proper field construction: Creates field references with appropriate names (arg_{i}) and metadata
  • Clean imports: Added necessary imports (FieldRef, ReturnFieldArgs, WindowUDFFieldArgs) without clutter

Minor observations:

  • The field naming format!("arg_{i}") is consistent but generic. This is acceptable for internal use in information schema
  • scalar_arguments: vec![None; arg_fields.len()] indicates no constant folding at schema generation time, which is appropriate

Information Schema Changes (information_schema.rs)

Scalar UDF (get_udf_args_and_return_types)

Before: Called udf.return_type(&arg_types) - only received DataTypes
After: Calls udf.return_field_from_args(...) - receives full Field with metadata

// Good: Properly constructs arg_fields from arg_types
let arg_fields: Vec<FieldRef> = arg_types
    .iter()
    .enumerate()
    .map(|(i, t)| Arc::new(Field::new(format!("arg_{i}"), t.clone(), true)))
    .collect();

Correct implementation - extracts data_type from the returned Field

Aggregate UDF (get_udaf_args_and_return_types)

Change: Now uses udaf.return_field(&arg_fields) instead of udaf.return_type(&arg_types)

Consistent pattern - mirrors the scalar UDF approach

Window UDF (get_udwf_args_and_return_types)

Before: Returned (arg_types, None) - window function return types were not populated!
After: Uses udwf.field(WindowUDFFieldArgs::new(&arg_fields, udwf.name())) - now returns actual return type

🎯 This is a bug fix! The old code always returned None for window function return types in the information schema.


date_trunc Changes (date_trunc.rs)

Before:

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
    if arg_types[1].is_null() {
        Ok(Timestamp(Nanosecond, None))
    } else {
        Ok(arg_types[1].clone())
    }
}

After:

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    internal_err!("return_field_from_args should be called instead")
}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
    let field = &args.arg_fields[1];
    let return_type = if field.data_type().is_null() {
        Timestamp(Nanosecond, None)
    } else {
        field.data_type().clone()
    };
    Ok(Arc::new(Field::new(
        self.name(),
        return_type,
        field.is_nullable(),
    )))
}

Good migration pattern:

  • Deprecated return_type() now errors with helpful message
  • Logic moved to return_field_from_args() with identical semantics
  • Preserves nullability from input field

Potential Issues & Concerns

1. Array Indexing Without Bounds Check ⚠️

Location: date_trunc.rs:231

let field = &args.arg_fields[1];

Risk: If return_field_from_args is called with fewer than 2 arguments, this will panic.

Mitigation: The function signature requires 2 arguments (granularity and timestamp), so this should be validated by the signature checker before this method is called. However, defensive programming suggests:

let field = args.arg_fields.get(1).ok_or_else(|| {
    internal_err!("date_trunc requires at least 2 arguments")
})?;

Severity: Low - unlikely to occur in practice due to signature validation

2. Backwards Compatibility ⚠️

Location: date_trunc.rs:226-228

The return_type() method now returns internal_err!() instead of the actual type. This could break:

  • External code that directly calls return_type()
  • Reflection/introspection tools
  • The information schema code before this PR (but this PR updates it)

Question: Are there any external callers of return_type() outside this PR? If so, this could be a breaking change.

Severity: Medium - depends on API guarantees


Performance Considerations

No performance regressions expected:

  • Information schema generation is typically not performance-critical
  • Field allocation (Arc::new(Field::new(...))) has minimal overhead
  • The new approach is actually more efficient as it avoids reconstructing fields that already exist

Potential improvement: The changes eliminate redundant type-to-field conversions in some code paths


Security Concerns

No security issues identified:

  • No user input is directly processed
  • No SQL injection vectors
  • No unsafe code
  • Proper use of Arc for thread safety

Test Coverage

Questions:

  1. Are there existing tests for information schema that verify window function return types are populated?
  2. Are there tests for date_trunc that validate the return_field_from_args path?
  3. Do integration tests verify the information schema shows correct return types for all UDF types?

Recommendation: Verify that existing tests cover:

  • Information schema returns non-None for window functions (this was a bug)
  • date_trunc with null timestamp arguments
  • All three UDF types (scalar, aggregate, window) in information schema queries

Recommendations

  1. Add bounds checking in date_trunc.rs:231 for defensive programming
  2. Verify no external callers of return_type() will break
  3. Confirm test coverage for window function return types in information schema
  4. Consider adding a comment explaining why scalar_arguments is all None (because we're generating schema, not evaluating)

Overall Assessment

✅ Approve with minor suggestions

This is a solid refactoring that:

  • Modernizes the API usage
  • Fixes a bug where window function return types weren't populated in information schema
  • Maintains consistent patterns across all UDF types
  • Has clear, readable code

The changes are well-structured and the code quality is good. The only concerns are minor (array indexing) and should be addressed for defensive programming.

Nice work! 🚀

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly refactors information_schema and date_trunc to use the more modern return_field_from_args and related APIs. The changes are well-implemented and improve correctness, especially the fix for determining window function return types in the information schema.

However, there is a critical issue: the tests have not been updated to reflect these changes. The change to get_udwf_args_and_return_types in information_schema.rs is a fix that now allows correctly reporting return types for window functions like rank(). The existing sqllogictest in datafusion/sqllogictest/test_files/information_schema.slt still asserts the old, incorrect behavior (e.g., a NULL return type for rank()). Consequently, the CI tests will likely fail.

To resolve this, the test outputs in datafusion/sqllogictest/test_files/information_schema.slt must be updated to match the new, correct behavior. Specifically:

  • The test for information_schema.routines should be updated to show UInt64 as the data_type for rank.
  • The test for information_schema.parameters should be updated to include an OUT parameter of type UInt64 for rank.

Without these test updates, the correctness of this important fix is not verified, and the PR cannot be merged.

@augmentcode
Copy link

augmentcode bot commented Jan 31, 2026

🤖 Augment PR Summary

Summary: Updates return-type introspection to use field-based APIs instead of only DataType.

Changes:

  • Information schema now synthesizes `FieldRef` arguments and uses `return_field_from_args` / `return_field` / `WindowUDFFieldArgs` to derive UDF/UDAF/UDWF output types.
  • `date_trunc` makes `return_field_from_args` the source of truth and turns `return_type` into an error to prevent accidental use.
Technical Notes: Improves compatibility with functions whose return types depend on argument fields/metadata and aligns with the newer return-field APIs.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Ok(arg_types[1].clone())
}
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DateTruncFunc::return_type now always errors, datafusion/functions/benches/date_trunc.rs (around lines 60–63) will now panic when building return_field via udf.return_type(...). Consider updating the benchmark to derive the return field via return_field_from_args (or otherwise avoid calling return_type).

Other Locations
  • datafusion/functions/benches/date_trunc.rs:60

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The Augment AI reviewer is the only one that checked this unchanged but related file in the repository and found the issue! The benchmark tests still compile because the old method is still there but since they are not executed by the CI the runtime error is not detected by the Pull Request author. Prevents detecting that the benchmark test is broken in a later stage, possible by another developer.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@datafusion/functions/src/datetime/date_trunc.rs`:
- Around line 226-241: The method return_field_from_args currently indexes
args.arg_fields[1] directly and can panic if fewer than two args are passed;
update return_field_from_args to first validate args.arg_fields has at least 2
elements (or use args.arg_fields.get(1)) and return a Result::Err (use the
existing internal_err! or another appropriate error helper) when missing, then
proceed to derive return_type from the second arg as before (referencing
return_field_from_args, args.arg_fields, and self.name()) so the function
returns a proper error instead of panicking.

Comment on lines +226 to 241
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let data_types = args
.arg_fields
.iter()
.map(|f| f.data_type())
.cloned()
.collect::<Vec<_>>();
let return_type = self.return_type(&data_types)?;
let field = &args.arg_fields[1];
let return_type = if field.data_type().is_null() {
Timestamp(Nanosecond, None)
} else {
field.data_type().clone()
};
Ok(Arc::new(Field::new(
self.name(),
return_type,
args.arg_fields[1].is_nullable(),
field.is_nullable(),
)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against missing args to avoid panic.

Line 231 indexes args.arg_fields[1] directly. If this method is ever invoked with fewer than two args (planner bug or misuse), it will panic instead of returning a user-facing error.

🛠️ Proposed fix
-        let field = &args.arg_fields[1];
+        let field = args
+            .arg_fields
+            .get(1)
+            .ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}
fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let data_types = args
.arg_fields
.iter()
.map(|f| f.data_type())
.cloned()
.collect::<Vec<_>>();
let return_type = self.return_type(&data_types)?;
let field = &args.arg_fields[1];
let return_type = if field.data_type().is_null() {
Timestamp(Nanosecond, None)
} else {
field.data_type().clone()
};
Ok(Arc::new(Field::new(
self.name(),
return_type,
args.arg_fields[1].is_nullable(),
field.is_nullable(),
)))
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}
fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let field = args
.arg_fields
.get(1)
.ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?;
let return_type = if field.data_type().is_null() {
Timestamp(Nanosecond, None)
} else {
field.data_type().clone()
};
Ok(Arc::new(Field::new(
self.name(),
return_type,
field.is_nullable(),
)))
🤖 Prompt for AI Agents
In `@datafusion/functions/src/datetime/date_trunc.rs` around lines 226 - 241, The
method return_field_from_args currently indexes args.arg_fields[1] directly and
can panic if fewer than two args are passed; update return_field_from_args to
first validate args.arg_fields has at least 2 elements (or use
args.arg_fields.get(1)) and return a Result::Err (use the existing internal_err!
or another appropriate error helper) when missing, then proceed to derive
return_type from the second arg as before (referencing return_field_from_args,
args.arg_fields, and self.name()) so the function returns a proper error instead
of panicking.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check),

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

.map(|(i, t)| {
Arc::new(Field::new(format!("arg_{i}"), t.clone(), true))
})
.collect();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated arg_fields construction logic across three functions

Low Severity

The same code block to build arg_fields from arg_types is duplicated verbatim in get_udf_args_and_return_types, get_udaf_args_and_return_types, and get_udwf_args_and_return_types. This could be extracted into a helper function like fn build_arg_fields(arg_types: &[DataType]) -> Vec<FieldRef>.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Bugbot AI reviewer is correct! There is some duplicated code that would be good to be extracted to a helper function and reused. Prevents double maintenance and possible regressions if a fix is applied at only one of the copies.

@martin-augment
Copy link
Owner Author

1. Array Indexing Without Bounds Check ⚠️

Location: date_trunc.rs:231

let field = &args.arg_fields[1];

Risk: If return_field_from_args is called with fewer than 2 arguments, this will panic.

Mitigation: The function signature requires 2 arguments (granularity and timestamp), so this should be validated by the signature checker before this method is called. However, defensive programming suggests:

let field = args.arg_fields.get(1).ok_or_else(|| {
    internal_err!("date_trunc requires at least 2 arguments")
})?;

Severity: Low - unlikely to occur in practice due to signature validation

value:incorrect-but-reasonable; category:bug; feedback: The Claude AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants