Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 31, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

For non-ASCII strings, the original implementation used string.find() to get the byte index, then counted characters up to that byte index. This required two passes through the string.

This optimization uses char_indices() to find the substring while simultaneously tracking character positions, completing the search in a single pass.

Benchmark results (UTF-8 strings):

  • str_len_8: 188.98 µs → 140.54 µs (25.4% faster)
  • str_len_32: 615.69 µs → 294.15 µs (52.2% faster)
  • str_len_128: 2.2707 ms → 1.2462 ms (45.1% faster)
  • str_len_4096: 74.328 ms → 36.538 ms (50.9% faster)

ASCII performance unchanged (already optimized with fast path).

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 31, 2025
For non-ASCII strings, the original implementation used string.find()
to get the byte index, then counted characters up to that byte index.
This required two passes through the string.

This optimization uses char_indices() to find the substring while
simultaneously tracking character positions, completing the search
in a single pass.

Benchmark results (UTF-8 strings):
- str_len_8:    188.98 µs → 140.54 µs  (25.4% faster)
- str_len_32:   615.69 µs → 294.15 µs  (52.2% faster)
- str_len_128:  2.2707 ms → 1.2462 ms  (45.1% faster)
- str_len_4096: 74.328 ms → 36.538 ms  (50.9% faster)

ASCII performance unchanged (already optimized with fast path).
@adriangb
Copy link
Contributor

run benchmark character_length

@alamb-ghbot

This comment was marked as outdated.

@alamb
Copy link
Contributor

alamb commented Dec 31, 2025

run benchmark character_length

(I just added this to the whitelist and will run now)

@alamb
Copy link
Contributor

alamb commented Dec 31, 2025

run benchmark character_length

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=character_length
BENCH_COMMAND=cargo bench --features=parquet --bench character_length
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Dec 31, 2025

run benchmark strpos

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                  main                                   strpos_optimize
-----                                                  ----                                   ---------------
character_length_StringArray_ascii_str_len_128         1.00     27.4±0.84µs        ? ?/sec    1.05     28.8±1.72µs        ? ?/sec
character_length_StringArray_ascii_str_len_32          1.00      9.7±0.07µs        ? ?/sec    1.02      9.9±0.16µs        ? ?/sec
character_length_StringArray_ascii_str_len_4096        1.00      2.2±0.05ms        ? ?/sec    1.00      2.2±0.06ms        ? ?/sec
character_length_StringArray_ascii_str_len_8           1.00      6.0±0.13µs        ? ?/sec    1.00      6.0±0.07µs        ? ?/sec
character_length_StringArray_utf8_str_len_128          1.00    246.6±2.47µs        ? ?/sec    1.00    247.3±3.28µs        ? ?/sec
character_length_StringArray_utf8_str_len_32           1.00    182.2±2.39µs        ? ?/sec    1.00    181.4±1.28µs        ? ?/sec
character_length_StringArray_utf8_str_len_4096         1.00      5.3±0.07ms        ? ?/sec    1.00      5.3±0.09ms        ? ?/sec
character_length_StringArray_utf8_str_len_8            1.00    152.3±0.78µs        ? ?/sec    1.00    152.3±0.50µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_128     1.00     63.7±1.60µs        ? ?/sec    1.00     63.6±1.28µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_32      1.00     44.0±0.24µs        ? ?/sec    1.00     44.1±0.84µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_4096    1.02      2.1±0.05ms        ? ?/sec    1.00      2.1±0.06ms        ? ?/sec
character_length_StringViewArray_ascii_str_len_8       1.00     56.4±0.20µs        ? ?/sec    1.00     56.6±0.97µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_128      1.00    259.0±7.54µs        ? ?/sec    1.00    257.8±1.03µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_32       1.00    189.5±0.76µs        ? ?/sec    1.01    191.7±4.86µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_4096     1.01      5.4±0.08ms        ? ?/sec    1.00      5.4±0.10ms        ? ?/sec
character_length_StringViewArray_utf8_str_len_8        1.00    160.6±1.91µs        ? ?/sec    1.00    160.3±1.35µs        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=strpos
BENCH_COMMAND=cargo bench --features=parquet --bench strpos
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                        main                                   strpos_optimize
-----                                        ----                                   ---------------
strpos_StringArray_ascii_str_len_128         1.05   1137.5±6.77µs        ? ?/sec    1.00  1087.5±20.26µs        ? ?/sec
strpos_StringArray_ascii_str_len_32          1.00   327.9±11.29µs        ? ?/sec    1.02    333.7±6.68µs        ? ?/sec
strpos_StringArray_ascii_str_len_4096        1.00     32.4±0.95ms        ? ?/sec    1.22     39.4±0.34ms        ? ?/sec
strpos_StringArray_ascii_str_len_8           1.00    149.4±1.76µs        ? ?/sec    1.08    162.1±3.34µs        ? ?/sec
strpos_StringArray_utf8_str_len_128          1.07      3.3±0.01ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
strpos_StringArray_utf8_str_len_32           1.59  1371.2±11.86µs        ? ?/sec    1.00    863.3±7.31µs        ? ?/sec
strpos_StringArray_utf8_str_len_4096         1.09    102.0±1.46ms        ? ?/sec    1.00     93.6±0.54ms        ? ?/sec
strpos_StringArray_utf8_str_len_8            1.67   497.8±10.91µs        ? ?/sec    1.00    298.2±3.10µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_128     1.00  1259.8±11.10µs        ? ?/sec    1.01  1274.8±17.36µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_32      1.00    388.4±5.48µs        ? ?/sec    1.03    398.7±5.21µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_4096    1.00     38.8±0.46ms        ? ?/sec    1.02     39.5±0.93ms        ? ?/sec
strpos_StringViewArray_ascii_str_len_8       1.00    221.0±1.26µs        ? ?/sec    1.02    224.7±6.92µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_128      1.06      3.3±0.06ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_32       1.58   1398.0±4.49µs        ? ?/sec    1.00    886.5±8.76µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_4096     1.07    102.0±1.22ms        ? ?/sec    1.00     95.2±0.53ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_8        1.62    513.8±8.19µs        ? ?/sec    1.00    318.1±1.53µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jan 1, 2026

run benchmark strpos

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing strpos_optimize (5ed05f7) to 7c50448 diff
BENCH_NAME=strpos
BENCH_COMMAND=cargo bench --features=parquet --bench strpos
BENCH_FILTER=
BENCH_BRANCH_NAME=strpos_optimize
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- this makes sense to me

for (byte_idx, _) in string.char_indices() {
char_pos += 1;
if byte_idx + substring_bytes.len() <= string_bytes.len()
&& &string_bytes[byte_idx..byte_idx + substring_bytes.len()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could potentially use string_bytes.get_unchecked here if it makes any difference as you validate the bounds check immediately above.

@alamb alamb added the performance Make DataFusion faster label Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants