perf: Improve performance of `split_part` #19570

andygrove · 2025-12-30T22:02:03Z

Which issue does this PR close?

Closes #.

Rationale for this change

I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for split_part.

This PR fixes some obvious performance issues. Speedups are:

Benchmark	Before	After	Speedup
single_char_delim/pos_first	1.27ms	140µs	9.1x faster
single_char_delim/pos_middle	1.39ms	396µs	3.5x faster
single_char_delim/pos_last	1.47ms	738µs	2.0x faster
single_char_delim/pos_negative	1.35ms	148µs	9.1x faster
multi_char_delim/pos_first	1.22ms	174µs	7.0x faster
multi_char_delim/pos_middle	1.22ms	407µs	3.0x faster
string_view_single_char/pos_first	1.42ms	139µs	10.2x faster
many_parts_20/pos_second	2.48ms	201µs	12.3x faster
long_strings_50_parts/pos_first	8.18ms	178µs	46x faster

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

andygrove · 2025-12-30T22:04:13Z

datafusion/functions/src/string/split_part.rs

        .try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError> {
            match (string, delimiter, n) {
                (Some(string), Some(delimiter), Some(n)) => {
-                    let split_string: Vec<&str> = string.split(delimiter).collect();


This was allocating strings for all parts even if only some parts were needed

comphead · 2025-12-30T22:25:31Z

46x faster 👍

comphead

Thanks @andygrove the early return makes much more sense than eagerly calculating all the parts

martin-g · 2025-12-31T05:37:23Z

datafusion/functions/src/string/split_part.rs

+                        std::cmp::Ordering::Greater => {
+                            // Positive index: use nth() to avoid collecting all parts
+                            // This stops iteration as soon as we find the nth element
+                            string.split(delimiter).nth((n - 1) as usize)


Are 32-bit systems supported ?
n is Int64, so it is possible that this cast may lead to a truncation or even a crash in debug build

martin-g · 2025-12-31T05:41:21Z

datafusion/functions/src/string/split_part.rs

+                        std::cmp::Ordering::Less => {
+                            // Negative index: use rsplit().nth() to efficiently get from the end
+                            // rsplit iterates in reverse, so -1 means first from rsplit (index 0)
+                            string.rsplit(delimiter).nth((-n - 1) as usize)


another corner case: -n will fail for i64::MIN

andygrove added 2 commits December 30, 2025 15:01

optimize split_part

386176a

optimize split_part

36ea121

github-actions bot added the functions Changes to functions implementation label Dec 30, 2025

andygrove changed the title ~~optimize split_part~~ perf: Improve performance of split_part Dec 30, 2025

andygrove commented Dec 30, 2025

View reviewed changes

cargo fmt

04fe9b4

andygrove marked this pull request as ready for review December 30, 2025 22:06

andygrove added the performance Make DataFusion faster label Dec 30, 2025

andygrove requested review from Jefffrey, comphead and viirya December 30, 2025 22:09

comphead approved these changes Dec 30, 2025

View reviewed changes

viirya approved these changes Dec 30, 2025

View reviewed changes

Jefffrey approved these changes Dec 31, 2025

View reviewed changes

martin-g reviewed Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Improve performance of `split_part` #19570

perf: Improve performance of `split_part` #19570

andygrove commented Dec 30, 2025 •

edited

Loading

Uh oh!

andygrove Dec 30, 2025

Uh oh!

comphead commented Dec 30, 2025

Uh oh!

comphead left a comment

Uh oh!

martin-g Dec 31, 2025

Uh oh!

martin-g Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

perf: Improve performance of split_part #19570

Are you sure you want to change the base?

perf: Improve performance of split_part #19570

Conversation

andygrove commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

andygrove Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

comphead commented Dec 30, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

perf: Improve performance of `split_part` #19570

perf: Improve performance of `split_part` #19570

andygrove commented Dec 30, 2025 •

edited

Loading