-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: Improve performance of split_part
#19570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
split_part
| .try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError> { | ||
| match (string, delimiter, n) { | ||
| (Some(string), Some(delimiter), Some(n)) => { | ||
| let split_string: Vec<&str> = string.split(delimiter).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was allocating strings for all parts even if only some parts were needed
|
46x faster 👍 |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove the early return makes much more sense than eagerly calculating all the parts
| std::cmp::Ordering::Greater => { | ||
| // Positive index: use nth() to avoid collecting all parts | ||
| // This stops iteration as soon as we find the nth element | ||
| string.split(delimiter).nth((n - 1) as usize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are 32-bit systems supported ?
n is Int64, so it is possible that this cast may lead to a truncation or even a crash in debug build
| std::cmp::Ordering::Less => { | ||
| // Negative index: use rsplit().nth() to efficiently get from the end | ||
| // rsplit iterates in reverse, so -1 means first from rsplit (index 0) | ||
| string.rsplit(delimiter).nth((-n - 1) as usize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another corner case: -n will fail for i64::MIN
Which issue does this PR close?
Rationale for this change
I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for
split_part.This PR fixes some obvious performance issues. Speedups are:
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?