perf: optimize strpos by eliminating double iteration for UTF-8 #19572

viirya · 2025-12-31T00:29:45Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

For non-ASCII strings, the original implementation used string.find() to get the byte index, then counted characters up to that byte index. This required two passes through the string.

This optimization uses char_indices() to find the substring while simultaneously tracking character positions, completing the search in a single pass.

Benchmark results (UTF-8 strings):

str_len_8: 188.98 µs → 140.54 µs (25.4% faster)
str_len_32: 615.69 µs → 294.15 µs (52.2% faster)
str_len_128: 2.2707 ms → 1.2462 ms (45.1% faster)
str_len_4096: 74.328 ms → 36.538 ms (50.9% faster)

ASCII performance unchanged (already optimized with fast path).

Are these changes tested?

Are there any user-facing changes?

For non-ASCII strings, the original implementation used string.find() to get the byte index, then counted characters up to that byte index. This required two passes through the string. This optimization uses char_indices() to find the substring while simultaneously tracking character positions, completing the search in a single pass. Benchmark results (UTF-8 strings): - str_len_8: 188.98 µs → 140.54 µs (25.4% faster) - str_len_32: 615.69 µs → 294.15 µs (52.2% faster) - str_len_128: 2.2707 ms → 1.2462 ms (45.1% faster) - str_len_4096: 74.328 ms → 36.538 ms (50.9% faster) ASCII performance unchanged (already optimized with fast path).

adriangb · 2025-12-31T03:07:11Z

run benchmark character_length

alamb · 2025-12-31T12:38:23Z

run benchmark character_length

(I just added this to the whitelist and will run now)

alamb · 2025-12-31T12:38:38Z

run benchmark character_length

alamb-ghbot · 2025-12-31T12:38:46Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=character_length
BENCH_COMMAND=cargo bench --features=parquet --bench character_length
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

alamb · 2025-12-31T12:40:37Z

run benchmark strpos

alamb-ghbot · 2025-12-31T12:50:14Z

🤖: Benchmark completed

Details

group                                                  main                                   strpos_optimize
-----                                                  ----                                   ---------------
character_length_StringArray_ascii_str_len_128         1.00     27.4±0.84µs        ? ?/sec    1.05     28.8±1.72µs        ? ?/sec
character_length_StringArray_ascii_str_len_32          1.00      9.7±0.07µs        ? ?/sec    1.02      9.9±0.16µs        ? ?/sec
character_length_StringArray_ascii_str_len_4096        1.00      2.2±0.05ms        ? ?/sec    1.00      2.2±0.06ms        ? ?/sec
character_length_StringArray_ascii_str_len_8           1.00      6.0±0.13µs        ? ?/sec    1.00      6.0±0.07µs        ? ?/sec
character_length_StringArray_utf8_str_len_128          1.00    246.6±2.47µs        ? ?/sec    1.00    247.3±3.28µs        ? ?/sec
character_length_StringArray_utf8_str_len_32           1.00    182.2±2.39µs        ? ?/sec    1.00    181.4±1.28µs        ? ?/sec
character_length_StringArray_utf8_str_len_4096         1.00      5.3±0.07ms        ? ?/sec    1.00      5.3±0.09ms        ? ?/sec
character_length_StringArray_utf8_str_len_8            1.00    152.3±0.78µs        ? ?/sec    1.00    152.3±0.50µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_128     1.00     63.7±1.60µs        ? ?/sec    1.00     63.6±1.28µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_32      1.00     44.0±0.24µs        ? ?/sec    1.00     44.1±0.84µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_4096    1.02      2.1±0.05ms        ? ?/sec    1.00      2.1±0.06ms        ? ?/sec
character_length_StringViewArray_ascii_str_len_8       1.00     56.4±0.20µs        ? ?/sec    1.00     56.6±0.97µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_128      1.00    259.0±7.54µs        ? ?/sec    1.00    257.8±1.03µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_32       1.00    189.5±0.76µs        ? ?/sec    1.01    191.7±4.86µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_4096     1.01      5.4±0.08ms        ? ?/sec    1.00      5.4±0.10ms        ? ?/sec
character_length_StringViewArray_utf8_str_len_8        1.00    160.6±1.91µs        ? ?/sec    1.00    160.3±1.35µs        ? ?/sec

alamb-ghbot · 2025-12-31T12:50:17Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=strpos
BENCH_COMMAND=cargo bench --features=parquet --bench strpos
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

alamb-ghbot · 2025-12-31T13:00:14Z

🤖: Benchmark completed

Details

group                                        main                                   strpos_optimize
-----                                        ----                                   ---------------
strpos_StringArray_ascii_str_len_128         1.05   1137.5±6.77µs        ? ?/sec    1.00  1087.5±20.26µs        ? ?/sec
strpos_StringArray_ascii_str_len_32          1.00   327.9±11.29µs        ? ?/sec    1.02    333.7±6.68µs        ? ?/sec
strpos_StringArray_ascii_str_len_4096        1.00     32.4±0.95ms        ? ?/sec    1.22     39.4±0.34ms        ? ?/sec
strpos_StringArray_ascii_str_len_8           1.00    149.4±1.76µs        ? ?/sec    1.08    162.1±3.34µs        ? ?/sec
strpos_StringArray_utf8_str_len_128          1.07      3.3±0.01ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
strpos_StringArray_utf8_str_len_32           1.59  1371.2±11.86µs        ? ?/sec    1.00    863.3±7.31µs        ? ?/sec
strpos_StringArray_utf8_str_len_4096         1.09    102.0±1.46ms        ? ?/sec    1.00     93.6±0.54ms        ? ?/sec
strpos_StringArray_utf8_str_len_8            1.67   497.8±10.91µs        ? ?/sec    1.00    298.2±3.10µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_128     1.00  1259.8±11.10µs        ? ?/sec    1.01  1274.8±17.36µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_32      1.00    388.4±5.48µs        ? ?/sec    1.03    398.7±5.21µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_4096    1.00     38.8±0.46ms        ? ?/sec    1.02     39.5±0.93ms        ? ?/sec
strpos_StringViewArray_ascii_str_len_8       1.00    221.0±1.26µs        ? ?/sec    1.02    224.7±6.92µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_128      1.06      3.3±0.06ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_32       1.58   1398.0±4.49µs        ? ?/sec    1.00    886.5±8.76µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_4096     1.07    102.0±1.22ms        ? ?/sec    1.00     95.2±0.53ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_8        1.62    513.8±8.19µs        ? ?/sec    1.00    318.1±1.53µs        ? ?/sec

alamb · 2026-01-01T14:06:16Z

run benchmark strpos

alamb-ghbot · 2026-01-01T14:06:22Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=strpos
BENCH_COMMAND=cargo bench --features=parquet --bench strpos
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

alamb

Thanks @viirya -- this makes sense to me

alamb · 2026-01-01T14:09:16Z

datafusion/functions/src/unicode/strpos.rs

+                    for (byte_idx, _) in string.char_indices() {
+                        char_pos += 1;
+                        if byte_idx + substring_bytes.len() <= string_bytes.len()
+                            && &string_bytes[byte_idx..byte_idx + substring_bytes.len()]


you could potentially use string_bytes.get_unchecked here if it makes any difference as you validate the bounds check immediately above.

github-actions bot added the functions Changes to functions implementation label Dec 31, 2025

viirya force-pushed the strpos_optimize branch from d190f85 to 5ed05f7 Compare December 31, 2025 00:33

viirya requested a review from andygrove December 31, 2025 00:49

This comment was marked as outdated.

Sign in to view

alamb approved these changes Jan 1, 2026

View reviewed changes

alamb added the performance Make DataFusion faster label Jan 1, 2026

perf: optimize strpos by eliminating double iteration for UTF-8 #19572

Are you sure you want to change the base?

perf: optimize strpos by eliminating double iteration for UTF-8 #19572

Conversation

viirya commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented Dec 31, 2025

Uh oh!

This comment was marked as outdated.

alamb commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Dec 31, 2025

Uh oh!

alamb-ghbot commented Dec 31, 2025

Uh oh!

alamb commented Dec 31, 2025

Uh oh!

alamb-ghbot commented Dec 31, 2025

Uh oh!

alamb-ghbot commented Dec 31, 2025

Uh oh!

alamb-ghbot commented Dec 31, 2025

Uh oh!

alamb commented Jan 1, 2026

Uh oh!

alamb-ghbot commented Jan 1, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented Dec 31, 2025 •

edited

Loading

alamb commented Dec 31, 2025 •

edited

Loading