-
Notifications
You must be signed in to change notification settings - Fork 0
20069: perf: optimise right for byte access and StringView #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant performance enhancements to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe changes extend DataFusion's ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review for PR #220: Optimize right function for byte access and StringViewSummaryThis PR optimizes the right function to use byte-level indexing instead of character iteration, and adds specialized StringView support. Overall, this is a well-implemented performance optimization with good test coverage and benchmarks. Strengths
Issues FoundIssue 1: Debug println statements left in codeLocation: datafusion/functions/src/unicode/right.rs:154-159 Remove the commented-out debug code before merging. Issue 2: Missing helper function for StringViewLocation: datafusion/functions/src/unicode/right.rs:138-160 The right_impl_view function manually constructs views, but the left implementation has a dedicated shrink_string_view_array_view helper function. Consider extracting a similar helper function for better maintainability and consistency. Issue 3: Inconsistent test dataThe test strings were changed from joséésoj to joséérend without clear rationale. The left function uses joséésoj which makes conceptual sense for testing both operations. Consider keeping test data consistent. Suggestions
Security AssessmentNo security concerns identified:
Performance ConsiderationsPositive aspects:
Test CoverageCurrent coverage is good with tests for:
Final RecommendationApprove with minor changes This is a well-executed performance optimization. The core implementation is sound and test coverage is comprehensive. Action Items:MUST fix before merge:
SHOULD consider: OPTIONAL: Great work on this optimization! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly optimizes the right string function for StringArray, LargeStringArray, and StringViewArray by avoiding new string allocations and introducing a comprehensive benchmark suite. A security review of datafusion/functions/benches/right.rs and datafusion/functions/src/unicode/right.rs found no vulnerabilities, specifically regarding data flow from n and potential overflow in right_impl_view. I've provided a couple of suggestions for minor improvements in the specific comments.
| fn criterion_benchmark(c: &mut Criterion) { | ||
| for is_string_view in [false, true] { | ||
| for size in [1024, 4096] { | ||
| let mut group = c.benchmark_group(format!("right size={size}")); | ||
|
|
||
| // Benchmark with positive n (no optimization needed) | ||
| let mut function_name = if is_string_view { | ||
| "string_view_array positive n" | ||
| } else { | ||
| "string_array positive n" | ||
| }; | ||
| let args = create_args(size, 32, false, is_string_view); | ||
| group.bench_function(BenchmarkId::new(function_name, size), |b| { | ||
| let arg_fields = args | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(idx, arg)| { | ||
| Field::new(format!("arg_{idx}"), arg.data_type(), true).into() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
| let config_options = Arc::new(ConfigOptions::default()); | ||
|
|
||
| b.iter(|| { | ||
| black_box( | ||
| right() | ||
| .invoke_with_args(ScalarFunctionArgs { | ||
| args: args.clone(), | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: size, | ||
| return_field: Field::new("f", DataType::Utf8, true) | ||
| .into(), | ||
| config_options: Arc::clone(&config_options), | ||
| }) | ||
| .expect("right should work"), | ||
| ) | ||
| }) | ||
| }); | ||
|
|
||
| // Benchmark with negative n (triggers optimization) | ||
| function_name = if is_string_view { | ||
| "string_view_array negative n" | ||
| } else { | ||
| "string_array negative n" | ||
| }; | ||
| let args = create_args(size, 32, true, is_string_view); | ||
| group.bench_function(BenchmarkId::new(function_name, size), |b| { | ||
| let arg_fields = args | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(idx, arg)| { | ||
| Field::new(format!("arg_{idx}"), arg.data_type(), true).into() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
| let config_options = Arc::new(ConfigOptions::default()); | ||
|
|
||
| b.iter(|| { | ||
| black_box( | ||
| right() | ||
| .invoke_with_args(ScalarFunctionArgs { | ||
| args: args.clone(), | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: size, | ||
| return_field: Field::new("f", DataType::Utf8, true) | ||
| .into(), | ||
| config_options: Arc::clone(&config_options), | ||
| }) | ||
| .expect("right should work"), | ||
| ) | ||
| }) | ||
| }); | ||
|
|
||
| group.finish(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark file has some opportunities for improvement:
-
Incorrect Return Type: The
return_fieldinScalarFunctionArgsis hardcoded toDataType::Utf8. This is incorrect when benchmarkingStringViewArray, as therightfunction will return aStringViewArray(withDataType::Utf8View). This will cause an assertion failure in debug builds. The return type should be dynamically set based on the input arguments' type. -
Code Duplication: The benchmarking logic for positive and negative
nis almost identical. This could be refactored into a helper function to reduce duplication and improve maintainability.
Here's a suggested refactoring that addresses both points:
fn bench_right(
group: &mut criterion::BenchmarkGroup<criterion::measurement::WallTime>,
function_name: &str,
size: usize,
args: Vec<ColumnarValue>,
) {
group.bench_function(BenchmarkId::new(function_name, size), |b| {
let arg_fields = args
.iter()
.enumerate()
.map(|(idx, arg)| {
Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect::<Vec<_>>();
let config_options = Arc::new(ConfigOptions::default());
let return_type = args[0].data_type();
b.iter(|| {
black_box(
right()
.invoke_with_args(ScalarFunctionArgs {
args: args.clone(),
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", return_type.clone(), true)
.into(),
config_options: Arc::clone(&config_options),
})
.expect("right should work"),
)
})
});
}
fn criterion_benchmark(c: &mut Criterion) {
for is_string_view in [false, true] {
for size in [1024, 4096] {
let mut group = c.benchmark_group(format!("right size={size}"));
// Benchmark with positive n
let function_name = if is_string_view {
"string_view_array positive n"
} else {
"string_array positive n"
};
let args = create_args(size, 32, false, is_string_view);
bench_right(&mut group, function_name, size, args);
// Benchmark with negative n
let function_name = if is_string_view {
"string_view_array negative n"
} else {
"string_array negative n"
};
let args = create_args(size, 32, true, is_string_view);
bench_right(&mut group, function_name, size, args);
group.finish();
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:delightful; category:bug; feedback: The Gemini AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested
| // println!( | ||
| // "Input string: {}, n: {} -> byte_length: {} -> {}", | ||
| // &string, | ||
| // n, | ||
| // byte_length, | ||
| // &string[byte_length..] | ||
| // ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm
🤖 Augment PR SummarySummary: Optimizes the Unicode Changes:
Technical Notes: For 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| args: args.clone(), | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: size, | ||
| return_field: Field::new("f", DataType::Utf8, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return_field is always set to DataType::Utf8 here, but right now returns Utf8View when the input is Utf8View; this can make the benchmark either fail or measure an unintended coercion path.
Other Locations
datafusion/functions/benches/right.rs:125
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:delightful; category:bug; feedback: The Augment AI reviewer is correct! The return field's type should depend on the is_string_view flag and return Utf8View when this flag is true. Prevents using a wrong type for the returned field when string views are tested
| match n.cmp(&0) { | ||
| Ordering::Less => string | ||
| .char_indices() | ||
| .nth(n.unsigned_abs() as usize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting n.unsigned_abs() to usize can truncate on 32-bit targets (or for extremely large |n|), which could yield incorrect slicing results compared to the intended “consume whole string” behavior.
Other Locations
datafusion/functions/src/unicode/right.rs:246
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:delightful; category:bug; feedback: The Augment AI reviewer is correct! On 32-bit systems the usize type is u32 and if the value of n is bigger than u32::MAX then it will be truncated/overflown and it will lead to wrong index and final outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| // n, | ||
| // byte_length, | ||
| // &string[byte_length..] | ||
| // ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out debug code left in production code
Low Severity
Commented-out println! debugging statements were left in the code. While this doesn't affect functionality, it clutters the codebase and suggests incomplete cleanup before submission. The debug statements in right_impl at lines 157-163 appear to be development artifacts that weren't removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:documentation; feedback: The Bugbot AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The Pull Request author has forgotten some debug leftover that should be removed. It is commented out, so it does not cause any harm |


20069: To review by AI