Skip to content

Conversation

@martin-augment
Copy link
Owner

20069: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance enhancements to the right string function in DataFusion. By refactoring its implementation to leverage byte-indexed slicing and optimizing operations for StringViewArray, the changes aim to reduce computation overhead and improve execution speed for string manipulation tasks. The addition of new benchmarks will help monitor and validate these performance gains.

Highlights

  • Performance Optimization for right function: The right string function has been significantly optimized for performance, particularly for StringViewArray and byte-level access, by leveraging more efficient string slicing techniques.
  • New Benchmarks Added: Dedicated Criterion benchmarks for the right function have been introduced to measure and track performance improvements across different string types (StringArray, StringViewArray) and various input scenarios (positive and negative n values).
  • Refactored right Implementation: The internal implementation of the right function was refactored to use byte-indexed slicing for GenericStringArray and StringViewArray, which avoids costly character-by-character iteration for UTF-8 strings.
  • Enhanced StringViewArray Support: The right function now includes a specialized and optimized implementation for StringViewArray, which reuses existing data buffers and adjusts views to minimize data copies.
  • Improved Unicode and Edge Case Handling: The logic for handling negative n values and complex unicode characters has been refined, and new test cases were added to ensure correctness and robustness across various inputs, including i64::MIN.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

Walkthrough

The changes extend DataFusion's right function to support additional string data types (LargeUtf8 and Utf8View). The implementation is refactored from a generic, type-parameterized approach to a non-generic entry point with type-specific internal handlers. New internal helper functions process standard string arrays and StringViewArray types. The RightFunc UDF struct is updated to advertise support for the new types. A new benchmark file is added to measure performance across different data sizes and input styles. The Cargo.toml is updated to register the new benchmark. Tests are added to verify Unicode correctness and edge cases across all supported string types.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20069-2026-01-30-12-40-25

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 30, 2026

Code Review for PR #220: Optimize right function for byte access and StringView

Summary

This PR optimizes the right function to use byte-level indexing instead of character iteration, and adds specialized StringView support. Overall, this is a well-implemented performance optimization with good test coverage and benchmarks.

Strengths

  1. Performance Optimization: The byte-level indexing approach using char_indices() with nth() and nth_back() is significantly more efficient than the previous character collection approach
  2. Comprehensive Benchmarking: The new benchmark file (benches/right.rs) properly tests both positive and negative n values for both String and StringView arrays
  3. StringView Optimization: The right_impl_view function intelligently reuses existing buffer views when possible (for strings > 12 bytes), avoiding unnecessary allocations
  4. Good Test Coverage: Tests cover edge cases including unicode characters, negative values, boundary conditions, and null handling
  5. Code Consistency: The implementation closely follows the existing left function pattern, maintaining consistency across the codebase

Issues Found

Issue 1: Debug println statements left in code

Location: datafusion/functions/src/unicode/right.rs:154-159

Remove the commented-out debug code before merging.

Issue 2: Missing helper function for StringView

Location: datafusion/functions/src/unicode/right.rs:138-160

The right_impl_view function manually constructs views, but the left implementation has a dedicated shrink_string_view_array_view helper function. Consider extracting a similar helper function for better maintainability and consistency.

Issue 3: Inconsistent test data

The test strings were changed from joséésoj to joséérend without clear rationale. The left function uses joséésoj which makes conceptual sense for testing both operations. Consider keeping test data consistent.

Suggestions

  1. Benchmark variations: Consider adding benchmark dimensions for various string lengths (8, 16, 32, 128) especially around the 12-byte StringView inline threshold
  2. Documentation: Add a brief comment explaining the byte-indexing optimization rationale
  3. Test coverage: Add explicit LargeUtf8 test cases

Security Assessment

No security concerns identified:

  • Proper bounds checking via char_indices() and nth_back()
  • Safe byte slicing respects UTF-8 character boundaries
  • Null handling is correct
  • No unsafe code blocks

Performance Considerations

Positive aspects:

  • Byte-level indexing eliminates allocations from collect String
  • StringView buffer reuse minimizes copying for strings > 12 bytes
  • Both positive and negative n cases are benchmarked

Test Coverage

Current coverage is good with tests for:

  • Positive, negative, zero, and boundary values (including i64::MIN)
  • Unicode character handling with multi-byte UTF-8 sequences
  • Null value handling
  • Both Utf8 and Utf8View data types

Final Recommendation

Approve with minor changes

This is a well-executed performance optimization. The core implementation is sound and test coverage is comprehensive.

Action Items:

MUST fix before merge:

  1. Remove commented-out println debug code (lines 154-159)

SHOULD consider:
2. Extract StringView helper function for maintainability
3. Add explicit LargeUtf8 test cases

OPTIONAL:
4. Document the byte-indexing optimization rationale
5. Add benchmark dimensions for varying string lengths

Great work on this optimization!

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly optimizes the right string function for StringArray, LargeStringArray, and StringViewArray by avoiding new string allocations and introducing a comprehensive benchmark suite. A security review of datafusion/functions/benches/right.rs and datafusion/functions/src/unicode/right.rs found no vulnerabilities, specifically regarding data flow from n and potential overflow in right_impl_view. I've provided a couple of suggestions for minor improvements in the specific comments.

Comment on lines +63 to +137
fn criterion_benchmark(c: &mut Criterion) {
for is_string_view in [false, true] {
for size in [1024, 4096] {
let mut group = c.benchmark_group(format!("right size={size}"));

// Benchmark with positive n (no optimization needed)
let mut function_name = if is_string_view {
"string_view_array positive n"
} else {
"string_array positive n"
};
let args = create_args(size, 32, false, is_string_view);
group.bench_function(BenchmarkId::new(function_name, size), |b| {
let arg_fields = args
.iter()
.enumerate()
.map(|(idx, arg)| {
Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect::<Vec<_>>();
let config_options = Arc::new(ConfigOptions::default());

b.iter(|| {
black_box(
right()
.invoke_with_args(ScalarFunctionArgs {
args: args.clone(),
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", DataType::Utf8, true)
.into(),
config_options: Arc::clone(&config_options),
})
.expect("right should work"),
)
})
});

// Benchmark with negative n (triggers optimization)
function_name = if is_string_view {
"string_view_array negative n"
} else {
"string_array negative n"
};
let args = create_args(size, 32, true, is_string_view);
group.bench_function(BenchmarkId::new(function_name, size), |b| {
let arg_fields = args
.iter()
.enumerate()
.map(|(idx, arg)| {
Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect::<Vec<_>>();
let config_options = Arc::new(ConfigOptions::default());

b.iter(|| {
black_box(
right()
.invoke_with_args(ScalarFunctionArgs {
args: args.clone(),
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", DataType::Utf8, true)
.into(),
config_options: Arc::clone(&config_options),
})
.expect("right should work"),
)
})
});

group.finish();
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This benchmark file has some opportunities for improvement:

  1. Incorrect Return Type: The return_field in ScalarFunctionArgs is hardcoded to DataType::Utf8. This is incorrect when benchmarking StringViewArray, as the right function will return a StringViewArray (with DataType::Utf8View). This will cause an assertion failure in debug builds. The return type should be dynamically set based on the input arguments' type.

  2. Code Duplication: The benchmarking logic for positive and negative n is almost identical. This could be refactored into a helper function to reduce duplication and improve maintainability.

Here's a suggested refactoring that addresses both points:

fn bench_right(
    group: &mut criterion::BenchmarkGroup<criterion::measurement::WallTime>,
    function_name: &str,
    size: usize,
    args: Vec<ColumnarValue>,
) {
    group.bench_function(BenchmarkId::new(function_name, size), |b| {
        let arg_fields = args
            .iter()
            .enumerate()
            .map(|(idx, arg)| {
                Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
            })
            .collect::<Vec<_>>();
        let config_options = Arc::new(ConfigOptions::default());
        let return_type = args[0].data_type();

        b.iter(|| {
            black_box(
                right()
                    .invoke_with_args(ScalarFunctionArgs {
                        args: args.clone(),
                        arg_fields: arg_fields.clone(),
                        number_rows: size,
                        return_field: Field::new("f", return_type.clone(), true)
                            .into(),
                        config_options: Arc::clone(&config_options),
                    })
                    .expect("right should work"),
            )
        })
    });
}

fn criterion_benchmark(c: &mut Criterion) {
    for is_string_view in [false, true] {
        for size in [1024, 4096] {
            let mut group = c.benchmark_group(format!("right size={size}"));

            // Benchmark with positive n
            let function_name = if is_string_view {
                "string_view_array positive n"
            } else {
                "string_array positive n"
            };
            let args = create_args(size, 32, false, is_string_view);
            bench_right(&mut group, function_name, size, args);

            // Benchmark with negative n
            let function_name = if is_string_view {
                "string_view_array negative n"
            } else {
                "string_array negative n"
            };
            let args = create_args(size, 32, true, is_string_view);
            bench_right(&mut group, function_name, size, args);

            group.finish();
        }
    }
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:delightful; category:bug; feedback: The Gemini AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested

Comment on lines +157 to +163
// println!(
// "Input string: {}, n: {} -> byte_length: {} -> {}",
// &string,
// n,
// byte_length,
// &string[byte_length..]
// );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It looks like there's some leftover commented-out debugging code here. It would be best to remove it to keep the codebase clean.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

@augmentcode
Copy link

augmentcode bot commented Jan 30, 2026

🤖 Augment PR Summary

Summary: Optimizes the Unicode right string function to avoid extra work by slicing via byte offsets and adds first-class StringViewArray handling.

Changes:

  • Registers a new Criterion benchmark target for `right` under the `unicode_expressions` feature.
  • Adds a dedicated `right` benchmark covering StringArray vs StringViewArray and positive vs negative `n`.
  • Refactors the core implementation to compute the start position via `char_indices` and slice the original `&str` by byte offset.
  • Adds a `StringViewArray` implementation that rebuilds views to reuse underlying buffers when possible.
  • Updates the UDF return type to match the input string type (`Utf8`, `LargeUtf8`, `Utf8View`).
  • Extends tests for negative `n`, extreme values, and additional Unicode / Utf8View cases.

Technical Notes: For Utf8View inputs, the function now returns a StringViewArray with recomputed views and a merged null buffer.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

args: args.clone(),
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", DataType::Utf8, true)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_field is always set to DataType::Utf8 here, but right now returns Utf8View when the input is Utf8View; this can make the benchmark either fail or measure an unintended coercion path.

Other Locations
  • datafusion/functions/benches/right.rs:125

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested

match n.cmp(&0) {
Ordering::Less => string
.char_indices()
.nth(n.unsigned_abs() as usize)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting n.unsigned_abs() to usize can truncate on 32-bit targets (or for extremely large |n|), which could yield incorrect slicing results compared to the intended “consume whole string” behavior.

Other Locations
  • datafusion/functions/src/unicode/right.rs:246

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! On 32-bit systems the usize type is u32 and if the value of n is bigger than u32::MAX then it will be truncated/overflown and it will lead to wrong index and final outcome.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

// n,
// byte_length,
// &string[byte_length..]
// );
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug code left in production code

Low Severity

Commented-out println! debugging statements were left in the code. While this doesn't affect functionality, it clutters the codebase and suggests incomplete cleanup before submission. The debug statements in right_impl at lines 157-163 appear to be development artifacts that weren't removed.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Bugbot AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

@martin-augment
Copy link
Owner Author

Issue 1: Debug println statements left in code

Location: datafusion/functions/src/unicode/right.rs:154-159

Remove the commented-out debug code before merging.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants