20069: perf: optimise right for byte access and StringView #220

martin-augment · 2026-01-30T12:40:29Z

20069: To review by AI

gemini-code-assist · 2026-01-30T12:40:52Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance enhancements to the right string function in DataFusion. By refactoring its implementation to leverage byte-indexed slicing and optimizing operations for StringViewArray, the changes aim to reduce computation overhead and improve execution speed for string manipulation tasks. The addition of new benchmarks will help monitor and validate these performance gains.

Highlights

Performance Optimization for right function: The right string function has been significantly optimized for performance, particularly for StringViewArray and byte-level access, by leveraging more efficient string slicing techniques.
New Benchmarks Added: Dedicated Criterion benchmarks for the right function have been introduced to measure and track performance improvements across different string types (StringArray, StringViewArray) and various input scenarios (positive and negative n values).
Refactored right Implementation: The internal implementation of the right function was refactored to use byte-indexed slicing for GenericStringArray and StringViewArray, which avoids costly character-by-character iteration for UTF-8 strings.
Enhanced StringViewArray Support: The right function now includes a specialized and optimized implementation for StringViewArray, which reuses existing data buffers and adjusts views to minimize data copies.
Improved Unicode and Edge Case Handling: The logic for handling negative n values and complex unicode characters has been refined, and new test cases were added to ensure correctness and robustness across various inputs, including i64::MIN.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-30T12:41:10Z

Walkthrough

The changes extend DataFusion's right function to support additional string data types (LargeUtf8 and Utf8View). The implementation is refactored from a generic, type-parameterized approach to a non-generic entry point with type-specific internal handlers. New internal helper functions process standard string arrays and StringViewArray types. The RightFunc UDF struct is updated to advertise support for the new types. A new benchmark file is added to measure performance across different data sizes and input styles. The Cargo.toml is updated to register the new benchmark. Tests are added to verify Unicode correctness and edge cases across all supported string types.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20069-2026-01-30-12-40-25

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-30T12:43:11Z

Code Review for PR #220: Optimize right function for byte access and StringView

Summary

This PR optimizes the right function to use byte-level indexing instead of character iteration, and adds specialized StringView support. Overall, this is a well-implemented performance optimization with good test coverage and benchmarks.

Strengths

Performance Optimization: The byte-level indexing approach using char_indices() with nth() and nth_back() is significantly more efficient than the previous character collection approach
Comprehensive Benchmarking: The new benchmark file (benches/right.rs) properly tests both positive and negative n values for both String and StringView arrays
StringView Optimization: The right_impl_view function intelligently reuses existing buffer views when possible (for strings > 12 bytes), avoiding unnecessary allocations
Good Test Coverage: Tests cover edge cases including unicode characters, negative values, boundary conditions, and null handling
Code Consistency: The implementation closely follows the existing left function pattern, maintaining consistency across the codebase

Issues Found

Issue 1: Debug println statements left in code

Location: datafusion/functions/src/unicode/right.rs:154-159

Remove the commented-out debug code before merging.

Issue 2: Missing helper function for StringView

Location: datafusion/functions/src/unicode/right.rs:138-160

The right_impl_view function manually constructs views, but the left implementation has a dedicated shrink_string_view_array_view helper function. Consider extracting a similar helper function for better maintainability and consistency.

Issue 3: Inconsistent test data

The test strings were changed from joséésoj to joséérend without clear rationale. The left function uses joséésoj which makes conceptual sense for testing both operations. Consider keeping test data consistent.

Suggestions

Benchmark variations: Consider adding benchmark dimensions for various string lengths (8, 16, 32, 128) especially around the 12-byte StringView inline threshold
Documentation: Add a brief comment explaining the byte-indexing optimization rationale
Test coverage: Add explicit LargeUtf8 test cases

Security Assessment

No security concerns identified:

Proper bounds checking via char_indices() and nth_back()
Safe byte slicing respects UTF-8 character boundaries
Null handling is correct
No unsafe code blocks

Performance Considerations

Positive aspects:

Byte-level indexing eliminates allocations from collect String
StringView buffer reuse minimizes copying for strings > 12 bytes
Both positive and negative n cases are benchmarked

Test Coverage

Current coverage is good with tests for:

Positive, negative, zero, and boundary values (including i64::MIN)
Unicode character handling with multi-byte UTF-8 sequences
Null value handling
Both Utf8 and Utf8View data types

Final Recommendation

Approve with minor changes

This is a well-executed performance optimization. The core implementation is sound and test coverage is comprehensive.

Action Items:

MUST fix before merge:

Remove commented-out println debug code (lines 154-159)

SHOULD consider:
2. Extract StringView helper function for maintainability
3. Add explicit LargeUtf8 test cases

OPTIONAL:
4. Document the byte-indexing optimization rationale
5. Add benchmark dimensions for varying string lengths

Great work on this optimization!

gemini-code-assist

Code Review

This pull request significantly optimizes the right string function for StringArray, LargeStringArray, and StringViewArray by avoiding new string allocations and introducing a comprehensive benchmark suite. A security review of datafusion/functions/benches/right.rs and datafusion/functions/src/unicode/right.rs found no vulnerabilities, specifically regarding data flow from n and potential overflow in right_impl_view. I've provided a couple of suggestions for minor improvements in the specific comments.

gemini-code-assist · 2026-01-30T12:46:38Z

datafusion/functions/benches/right.rs

+fn criterion_benchmark(c: &mut Criterion) {
+    for is_string_view in [false, true] {
+        for size in [1024, 4096] {
+            let mut group = c.benchmark_group(format!("right size={size}"));
+
+            // Benchmark with positive n (no optimization needed)
+            let mut function_name = if is_string_view {
+                "string_view_array positive n"
+            } else {
+                "string_array positive n"
+            };
+            let args = create_args(size, 32, false, is_string_view);
+            group.bench_function(BenchmarkId::new(function_name, size), |b| {
+                let arg_fields = args
+                    .iter()
+                    .enumerate()
+                    .map(|(idx, arg)| {
+                        Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
+                    })
+                    .collect::<Vec<_>>();
+                let config_options = Arc::new(ConfigOptions::default());
+
+                b.iter(|| {
+                    black_box(
+                        right()
+                            .invoke_with_args(ScalarFunctionArgs {
+                                args: args.clone(),
+                                arg_fields: arg_fields.clone(),
+                                number_rows: size,
+                                return_field: Field::new("f", DataType::Utf8, true)
+                                    .into(),
+                                config_options: Arc::clone(&config_options),
+                            })
+                            .expect("right should work"),
+                    )
+                })
+            });
+
+            // Benchmark with negative n (triggers optimization)
+            function_name = if is_string_view {
+                "string_view_array negative n"
+            } else {
+                "string_array negative n"
+            };
+            let args = create_args(size, 32, true, is_string_view);
+            group.bench_function(BenchmarkId::new(function_name, size), |b| {
+                let arg_fields = args
+                    .iter()
+                    .enumerate()
+                    .map(|(idx, arg)| {
+                        Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
+                    })
+                    .collect::<Vec<_>>();
+                let config_options = Arc::new(ConfigOptions::default());
+
+                b.iter(|| {
+                    black_box(
+                        right()
+                            .invoke_with_args(ScalarFunctionArgs {
+                                args: args.clone(),
+                                arg_fields: arg_fields.clone(),
+                                number_rows: size,
+                                return_field: Field::new("f", DataType::Utf8, true)
+                                    .into(),
+                                config_options: Arc::clone(&config_options),
+                            })
+                            .expect("right should work"),
+                    )
+                })
+            });
+
+            group.finish();
+        }
+    }
+}


This benchmark file has some opportunities for improvement:

Incorrect Return Type: The return_field in ScalarFunctionArgs is hardcoded to DataType::Utf8. This is incorrect when benchmarking StringViewArray, as the right function will return a StringViewArray (with DataType::Utf8View). This will cause an assertion failure in debug builds. The return type should be dynamically set based on the input arguments' type.

Code Duplication: The benchmarking logic for positive and negative n is almost identical. This could be refactored into a helper function to reduce duplication and improve maintainability.

Here's a suggested refactoring that addresses both points:

fn bench_right( group: &mut criterion::BenchmarkGroup<criterion::measurement::WallTime>, function_name: &str, size: usize, args: Vec<ColumnarValue>, ) { group.bench_function(BenchmarkId::new(function_name, size), |b| { let arg_fields = args .iter() .enumerate() .map(|(idx, arg)| { Field::new(format!("arg_{idx}"), arg.data_type(), true).into() }) .collect::<Vec<_>>(); let config_options = Arc::new(ConfigOptions::default()); let return_type = args[0].data_type(); b.iter(|| { black_box( right() .invoke_with_args(ScalarFunctionArgs { args: args.clone(), arg_fields: arg_fields.clone(), number_rows: size, return_field: Field::new("f", return_type.clone(), true) .into(), config_options: Arc::clone(&config_options), }) .expect("right should work"), ) }) }); } fn criterion_benchmark(c: &mut Criterion) { for is_string_view in [false, true] { for size in [1024, 4096] { let mut group = c.benchmark_group(format!("right size={size}")); // Benchmark with positive n let function_name = if is_string_view { "string_view_array positive n" } else { "string_array positive n" }; let args = create_args(size, 32, false, is_string_view); bench_right(&mut group, function_name, size, args); // Benchmark with negative n let function_name = if is_string_view { "string_view_array negative n" } else { "string_array negative n" }; let args = create_args(size, 32, true, is_string_view); bench_right(&mut group, function_name, size, args); group.finish(); } } }

value:delightful; category:bug; feedback: The Gemini AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested

gemini-code-assist · 2026-01-30T12:46:38Z

datafusion/functions/src/unicode/right.rs

+                // println!(
+                //     "Input string: {}, n: {} -> byte_length: {} -> {}",
+                //     &string,
+                //     n,
+                //     byte_length,
+                //     &string[byte_length..]
+                // );


It looks like there's some leftover commented-out debugging code here. It would be best to remove it to keep the codebase clean.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

augmentcode · 2026-01-30T12:46:48Z

🤖 Augment PR Summary

Summary: Optimizes the Unicode right string function to avoid extra work by slicing via byte offsets and adds first-class StringViewArray handling.

Changes:

Registers a new Criterion benchmark target for `right` under the `unicode_expressions` feature.
Adds a dedicated `right` benchmark covering StringArray vs StringViewArray and positive vs negative `n`.
Refactors the core implementation to compute the start position via `char_indices` and slice the original `&str` by byte offset.
Adds a `StringViewArray` implementation that rebuilds views to reuse underlying buffers when possible.
Updates the UDF return type to match the input string type (`Utf8`, `LargeUtf8`, `Utf8View`).
Extends tests for negative `n`, extreme values, and additional Unicode / Utf8View cases.

Technical Notes: For Utf8View inputs, the function now returns a StringViewArray with recomputed views and a merged null buffer.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-30T12:46:49Z

datafusion/functions/benches/right.rs

+                                args: args.clone(),
+                                arg_fields: arg_fields.clone(),
+                                number_rows: size,
+                                return_field: Field::new("f", DataType::Utf8, true)


return_field is always set to DataType::Utf8 here, but right now returns Utf8View when the input is Utf8View; this can make the benchmark either fail or measure an unintended coercion path.

Other Locations

datafusion/functions/benches/right.rs:125

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested

augmentcode · 2026-01-30T12:46:49Z

datafusion/functions/src/unicode/right.rs

+    match n.cmp(&0) {
+        Ordering::Less => string
+            .char_indices()
+            .nth(n.unsigned_abs() as usize)


Casting n.unsigned_abs() to usize can truncate on 32-bit targets (or for extremely large |n|), which could yield incorrect slicing results compared to the intended “consume whole string” behavior.

Other Locations

datafusion/functions/src/unicode/right.rs:246

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:delightful; category:bug; feedback: The Augment AI reviewer is correct! On 32-bit systems the usize type is u32 and if the value of n is bigger than u32::MAX then it will be truncated/overflown and it will lead to wrong index and final outcome.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-30T12:50:22Z

datafusion/functions/src/unicode/right.rs

+                //     n,
+                //     byte_length,
+                //     &string[byte_length..]
+                // );


Commented-out debug code left in production code

Low Severity

Commented-out println! debugging statements were left in the code. While this doesn't affect functionality, it clutters the codebase and suggests incomplete cleanup before submission. The debug statements in right_impl at lines 157-163 appear to be development artifacts that weren't removed.

value:good-to-have; category:documentation; feedback: The Bugbot AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

martin-augment · 2026-01-30T14:04:54Z

Issue 1: Debug println statements left in code

Location: datafusion/functions/src/unicode/right.rs:154-159

Remove the commented-out debug code before merging.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm

theirix added 3 commits January 29, 2026 20:57

Add test for right with StringView

14056a9

Add bench for right

44235e0

Optimise right with zero-copy and StringViw specialisation

6ecde7f

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

augmentcode bot reviewed Jan 30, 2026

View reviewed changes

cursor bot reviewed Jan 30, 2026

View reviewed changes

20069: perf: optimise right for byte access and StringView #220

Are you sure you want to change the base?

20069: perf: optimise right for byte access and StringView #220

Uh oh!

Conversation

martin-augment commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 30, 2026

Code Review for PR #220: Optimize right function for byte access and StringView

Summary

Strengths

Issues Found

Issue 1: Debug println statements left in code

Issue 2: Missing helper function for StringView

Issue 3: Inconsistent test data

Suggestions

Security Assessment

Performance Considerations

Test Coverage

Final Recommendation

Action Items:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Jan 30, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Commented-out debug code left in production code

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 30, 2026

Issue 1: Debug println statements left in code

Uh oh!

Reviewers

Assignees

Labels

coderabbitai bot commented Jan 30, 2026 •

edited

Loading