20079: Use return_field_from_args in information schema and date_trunc #222

martin-augment · 2026-01-31T11:57:55Z

20079: To review by AI

…dow UDFs

gemini-code-assist · 2026-01-31T11:58:11Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant improvement in how DataFusion determines the return types of User-Defined Functions (UDFs), Aggregate User-Defined Functions (UDAFs), and Window User-Defined Functions (UDWFs) within its information schema. By transitioning from the simpler return_type method to the more context-aware return_field_from_args, the system can now infer return types with greater precision, taking into account the full field properties of input arguments. This change is also consistently applied to the date_trunc function, ensuring that its return type is derived using the new, more robust mechanism.

Highlights

Information Schema Refactoring: Refactored the information schema generation for User-Defined Functions (UDFs), Aggregate User-Defined Functions (UDAFs), and Window User-Defined Functions (UDWFs) to utilize the more robust return_field_from_args method for accurate return type inference.
Date_trunc Function Update: Updated the date_trunc function to implement return_field_from_args and explicitly deprecate its return_type method, ensuring consistency with the new type inference mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-31T11:58:20Z

Walkthrough

This pull request refactors UDF return type computation in DataFusion. Changes to the information schema update how scalar, aggregate, and window UDFs compute return types by constructing FieldRef arguments and deriving types from WindowUDFFieldArgs and ReturnFieldArgs instead of relying on return_type/return_field methods. The date_trunc function is updated to enforce the new pattern: its return_type method now signals an internal error directing usage toward return_field_from_args, which derives the return type from the second argument's field properties. This represents a shift from legacy type computation to field-based type derivation across the codebase.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20079-2026-01-31-11-57-52

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-31T12:00:02Z

PR Review: Use return_field_from_args in information schema and date_trunc

Summary

This PR updates the information schema and date_trunc function to use the newer return_field_from_args API instead of the older return_type method. This is a good refactoring that aligns with the modern DataFusion UDF API.

Code Quality & Best Practices ✅

Strengths:

Consistent API migration: The changes correctly migrate from the deprecated return_type() to return_field_from_args() across scalar, aggregate, and window UDFs
Code consistency: All three helper functions (get_udf_args_and_return_types, get_udaf_args_and_return_types, get_udwf_args_and_return_types) now follow the same pattern
Proper field construction: Creates field references with appropriate names (arg_{i}) and metadata
Clean imports: Added necessary imports (FieldRef, ReturnFieldArgs, WindowUDFFieldArgs) without clutter

Minor observations:

The field naming format!("arg_{i}") is consistent but generic. This is acceptable for internal use in information schema
scalar_arguments: vec![None; arg_fields.len()] indicates no constant folding at schema generation time, which is appropriate

Information Schema Changes (information_schema.rs)

Scalar UDF (get_udf_args_and_return_types)

Before: Called udf.return_type(&arg_types) - only received DataTypes
After: Calls udf.return_field_from_args(...) - receives full Field with metadata

// Good: Properly constructs arg_fields from arg_types
let arg_fields: Vec<FieldRef> = arg_types
    .iter()
    .enumerate()
    .map(|(i, t)| Arc::new(Field::new(format!("arg_{i}"), t.clone(), true)))
    .collect();

✅ Correct implementation - extracts data_type from the returned Field

Aggregate UDF (get_udaf_args_and_return_types)

Change: Now uses udaf.return_field(&arg_fields) instead of udaf.return_type(&arg_types)

✅ Consistent pattern - mirrors the scalar UDF approach

Window UDF (get_udwf_args_and_return_types)

Before: Returned (arg_types, None) - window function return types were not populated!
After: Uses udwf.field(WindowUDFFieldArgs::new(&arg_fields, udwf.name())) - now returns actual return type

🎯 This is a bug fix! The old code always returned None for window function return types in the information schema.

date_trunc Changes (date_trunc.rs)

Before:

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
    if arg_types[1].is_null() {
        Ok(Timestamp(Nanosecond, None))
    } else {
        Ok(arg_types[1].clone())
    }
}

After:

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    internal_err!("return_field_from_args should be called instead")
}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
    let field = &args.arg_fields[1];
    let return_type = if field.data_type().is_null() {
        Timestamp(Nanosecond, None)
    } else {
        field.data_type().clone()
    };
    Ok(Arc::new(Field::new(
        self.name(),
        return_type,
        field.is_nullable(),
    )))
}

✅ Good migration pattern:

Deprecated return_type() now errors with helpful message
Logic moved to return_field_from_args() with identical semantics
Preserves nullability from input field

Potential Issues & Concerns

1. Array Indexing Without Bounds Check ⚠️

Location: date_trunc.rs:231

let field = &args.arg_fields[1];

Risk: If return_field_from_args is called with fewer than 2 arguments, this will panic.

Mitigation: The function signature requires 2 arguments (granularity and timestamp), so this should be validated by the signature checker before this method is called. However, defensive programming suggests:

let field = args.arg_fields.get(1).ok_or_else(|| {
    internal_err!("date_trunc requires at least 2 arguments")
})?;

Severity: Low - unlikely to occur in practice due to signature validation

2. Backwards Compatibility ⚠️

Location: date_trunc.rs:226-228

The return_type() method now returns internal_err!() instead of the actual type. This could break:

External code that directly calls return_type()
Reflection/introspection tools
The information schema code before this PR (but this PR updates it)

Question: Are there any external callers of return_type() outside this PR? If so, this could be a breaking change.

Severity: Medium - depends on API guarantees

Performance Considerations

✅ No performance regressions expected:

Information schema generation is typically not performance-critical
Field allocation (Arc::new(Field::new(...))) has minimal overhead
The new approach is actually more efficient as it avoids reconstructing fields that already exist

⚡ Potential improvement: The changes eliminate redundant type-to-field conversions in some code paths

Security Concerns

✅ No security issues identified:

No user input is directly processed
No SQL injection vectors
No unsafe code
Proper use of Arc for thread safety

Test Coverage

❓ Questions:

Are there existing tests for information schema that verify window function return types are populated?
Are there tests for date_trunc that validate the return_field_from_args path?
Do integration tests verify the information schema shows correct return types for all UDF types?

Recommendation: Verify that existing tests cover:

Information schema returns non-None for window functions (this was a bug)
date_trunc with null timestamp arguments
All three UDF types (scalar, aggregate, window) in information schema queries

Recommendations

Add bounds checking in date_trunc.rs:231 for defensive programming
Verify no external callers of return_type() will break
Confirm test coverage for window function return types in information schema
Consider adding a comment explaining why scalar_arguments is all None (because we're generating schema, not evaluating)

Overall Assessment

✅ Approve with minor suggestions

This is a solid refactoring that:

Modernizes the API usage
Fixes a bug where window function return types weren't populated in information schema
Maintains consistent patterns across all UDF types
Has clear, readable code

The changes are well-structured and the code quality is good. The only concerns are minor (array indexing) and should be addressed for defensive programming.

Nice work! 🚀

gemini-code-assist

Code Review

This pull request correctly refactors information_schema and date_trunc to use the more modern return_field_from_args and related APIs. The changes are well-implemented and improve correctness, especially the fix for determining window function return types in the information schema.

However, there is a critical issue: the tests have not been updated to reflect these changes. The change to get_udwf_args_and_return_types in information_schema.rs is a fix that now allows correctly reporting return types for window functions like rank(). The existing sqllogictest in datafusion/sqllogictest/test_files/information_schema.slt still asserts the old, incorrect behavior (e.g., a NULL return type for rank()). Consequently, the CI tests will likely fail.

To resolve this, the test outputs in datafusion/sqllogictest/test_files/information_schema.slt must be updated to match the new, correct behavior. Specifically:

The test for information_schema.routines should be updated to show UInt64 as the data_type for rank.
The test for information_schema.parameters should be updated to include an OUT parameter of type UInt64 for rank.

Without these test updates, the correctness of this important fix is not verified, and the PR cannot be merged.

augmentcode · 2026-01-31T12:05:16Z

🤖 Augment PR Summary

Summary: Updates return-type introspection to use field-based APIs instead of only DataType.

Changes:

Information schema now synthesizes `FieldRef` arguments and uses `return_field_from_args` / `return_field` / `WindowUDFFieldArgs` to derive UDF/UDAF/UDWF output types.
`date_trunc` makes `return_field_from_args` the source of truth and turns `return_type` into an error to prevent accidental use.

Technical Notes: Improves compatibility with functions whose return types depend on argument fields/metadata and aligns with the newer return-field APIs.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-31T12:05:17Z

datafusion/functions/src/datetime/date_trunc.rs

-            Ok(arg_types[1].clone())
-        }
+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+        internal_err!("return_field_from_args should be called instead")


Since DateTruncFunc::return_type now always errors, datafusion/functions/benches/date_trunc.rs (around lines 60–63) will now panic when building return_field via udf.return_type(...). Consider updating the benchmark to derive the return field via return_field_from_args (or otherwise avoid calling return_type).

Other Locations

datafusion/functions/benches/date_trunc.rs:60

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The Augment AI reviewer is the only one that checked this unchanged but related file in the repository and found the issue! The benchmark tests still compile because the old method is still there but since they are not executed by the CI the runtime error is not detected by the Pull Request author. Prevents detecting that the benchmark test is broken in a later stage, possible by another developer.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@datafusion/functions/src/datetime/date_trunc.rs`:
- Around line 226-241: The method return_field_from_args currently indexes
args.arg_fields[1] directly and can panic if fewer than two args are passed;
update return_field_from_args to first validate args.arg_fields has at least 2
elements (or use args.arg_fields.get(1)) and return a Result::Err (use the
existing internal_err! or another appropriate error helper) when missing, then
proceed to derive return_type from the second arg as before (referencing
return_field_from_args, args.arg_fields, and self.name()) so the function
returns a proper error instead of panicking.

coderabbitai · 2026-01-31T12:06:23Z

datafusion/functions/src/datetime/date_trunc.rs

+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+        internal_err!("return_field_from_args should be called instead")
    }

    fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
-        let data_types = args
-            .arg_fields
-            .iter()
-            .map(|f| f.data_type())
-            .cloned()
-            .collect::<Vec<_>>();
-        let return_type = self.return_type(&data_types)?;
+        let field = &args.arg_fields[1];
+        let return_type = if field.data_type().is_null() {
+            Timestamp(Nanosecond, None)
+        } else {
+            field.data_type().clone()
+        };
        Ok(Arc::new(Field::new(
            self.name(),
            return_type,
-            args.arg_fields[1].is_nullable(),
+            field.is_nullable(),
        )))


⚠️ Potential issue | 🟡 Minor

Guard against missing args to avoid panic.

Line 231 indexes args.arg_fields[1] directly. If this method is ever invoked with fewer than two args (planner bug or misuse), it will panic instead of returning a user-facing error.

🛠️ Proposed fix

- let field = &args.arg_fields[1]; + let field = args + .arg_fields + .get(1) + .ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {

internal_err!("return_field_from_args should be called instead")

}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {

let data_types = args

.arg_fields

.iter()

.map(|f| f.data_type())

.cloned()

.collect::<Vec<_>>();

let return_type = self.return_type(&data_types)?;

let field = &args.arg_fields[1];

let return_type = if field.data_type().is_null() {

Timestamp(Nanosecond, None)

} else {

field.data_type().clone()

};

Ok(Arc::new(Field::new(

self.name(),

return_type,

args.arg_fields[1].is_nullable(),

field.is_nullable(),

)))

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {

internal_err!("return_field_from_args should be called instead")

}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {

let field = args

.arg_fields

.get(1)

.ok_or_else(|| exec_err!("{} expects 2 arguments", self.name()))?;

let return_type = if field.data_type().is_null() {

Timestamp(Nanosecond, None)

} else {

field.data_type().clone()

};

Ok(Arc::new(Field::new(

self.name(),

return_type,

field.is_nullable(),

)))

🤖 Prompt for AI Agents

In `@datafusion/functions/src/datetime/date_trunc.rs` around lines 226 - 241, The method return_field_from_args currently indexes args.arg_fields[1] directly and can panic if fewer than two args are passed; update return_field_from_args to first validate args.arg_fields has at least 2 elements (or use args.arg_fields.get(1)) and return a Result::Err (use the existing internal_err! or another appropriate error helper) when missing, then proceed to derive return_type from the second arg as before (referencing return_field_from_args, args.arg_fields, and self.name()) so the function returns a proper error instead of panicking.

value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check),

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-31T12:07:08Z

datafusion/catalog/src/information_schema.rs

+                    .map(|(i, t)| {
+                        Arc::new(Field::new(format!("arg_{i}"), t.clone(), true))
+                    })
+                    .collect();


Duplicated arg_fields construction logic across three functions

Low Severity

The same code block to build arg_fields from arg_types is duplicated verbatim in get_udf_args_and_return_types, get_udaf_args_and_return_types, and get_udwf_args_and_return_types. This could be extracted into a helper function like fn build_arg_fields(arg_types: &[DataType]) -> Vec<FieldRef>.

value:good-to-have; category:bug; feedback: The Bugbot AI reviewer is correct! There is some duplicated code that would be good to be extracted to a helper function and reused. Prevents double maintenance and possible regressions if a fix is applied at only one of the copies.

martin-augment · 2026-01-31T13:57:23Z

1. Array Indexing Without Bounds Check ⚠️

Location: date_trunc.rs:231
let field = &args.arg_fields[1];
Risk: If return_field_from_args is called with fewer than 2 arguments, this will panic.

Mitigation: The function signature requires 2 arguments (granularity and timestamp), so this should be validated by the signature checker before this method is called. However, defensive programming suggests:
let field = args.arg_fields.get(1).ok_or_else(|| {
    internal_err!("date_trunc requires at least 2 arguments")
})?;
Severity: Low - unlikely to occur in practice due to signature validation

value:incorrect-but-reasonable; category:bug; feedback: The Claude AI reviewer is correct to mark this as a possible issue but it is actually safe to do it because the method is called after an earlier check that all required function arguments are provided and they are from the allowed types (signature coercion check),

AndreaBozzo added 3 commits January 30, 2026 18:14

update return type

4eb8af2

update error message in return_type method for consistency

9d706cd

enhance get_udwf_args_and_return_types to include return type for win…

e8d78e8

…dow UDFs

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

augmentcode bot reviewed Jan 31, 2026

View reviewed changes

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

cursor bot reviewed Jan 31, 2026

View reviewed changes

20079: Use return_field_from_args in information schema and date_trunc #222

Are you sure you want to change the base?

20079: Use return_field_from_args in information schema and date_trunc #222

Uh oh!

Conversation

martin-augment commented Jan 31, 2026

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 31, 2026

PR Review: Use return_field_from_args in information schema and date_trunc

Summary

Code Quality & Best Practices ✅

Information Schema Changes (information_schema.rs)

Scalar UDF (get_udf_args_and_return_types)

Aggregate UDF (get_udaf_args_and_return_types)

Window UDF (get_udwf_args_and_return_types)

date_trunc Changes (date_trunc.rs)

Potential Issues & Concerns

1. Array Indexing Without Bounds Check ⚠️

2. Backwards Compatibility ⚠️

Performance Considerations

Security Concerns

Test Coverage

Recommendations

Overall Assessment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

augmentcode bot commented Jan 31, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 31, 2026

Choose a reason for hiding this comment

Duplicated arg_fields construction logic across three functions

Uh oh!

martin-augment Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 31, 2026

1. Array Indexing Without Bounds Check ⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Jan 31, 2026 •

edited

Loading