Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 30, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for split_part.

This PR fixes some obvious performance issues. Speedups are:

Benchmark Before After Speedup
single_char_delim/pos_first 1.27ms 140µs 9.1x faster
single_char_delim/pos_middle 1.39ms 396µs 3.5x faster
single_char_delim/pos_last 1.47ms 738µs 2.0x faster
single_char_delim/pos_negative 1.35ms 148µs 9.1x faster
multi_char_delim/pos_first 1.22ms 174µs 7.0x faster
multi_char_delim/pos_middle 1.22ms 407µs 3.0x faster
string_view_single_char/pos_first 1.42ms 139µs 10.2x faster
many_parts_20/pos_second 2.48ms 201µs 12.3x faster
long_strings_50_parts/pos_first 8.18ms 178µs 46x faster

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 30, 2025
@andygrove andygrove changed the title optimize split_part perf: Improve performance of split_part Dec 30, 2025
.try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError> {
match (string, delimiter, n) {
(Some(string), Some(delimiter), Some(n)) => {
let split_string: Vec<&str> = string.split(delimiter).collect();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was allocating strings for all parts even if only some parts were needed

@andygrove andygrove marked this pull request as ready for review December 30, 2025 22:06
@andygrove andygrove added the performance Make DataFusion faster label Dec 30, 2025
@comphead
Copy link
Contributor

46x faster 👍

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove the early return makes much more sense than eagerly calculating all the parts

std::cmp::Ordering::Greater => {
// Positive index: use nth() to avoid collecting all parts
// This stops iteration as soon as we find the nth element
string.split(delimiter).nth((n - 1) as usize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are 32-bit systems supported ?
n is Int64, so it is possible that this cast may lead to a truncation or even a crash in debug build

std::cmp::Ordering::Less => {
// Negative index: use rsplit().nth() to efficiently get from the end
// rsplit iterates in reverse, so -1 means first from rsplit (index 0)
string.rsplit(delimiter).nth((-n - 1) as usize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another corner case: -n will fail for i64::MIN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants