perf: optimize octet_length for string arrays #19581

Brijesh-Thakkar · 2025-12-31T13:39:39Z

Which issue does this PR close?

Addresses [EPIC] Optimize performance for slow expressions datafusion-comet#2986

Rationale for this change

The octet_length scalar function showed significant performance degradation in
Spark workloads when executed via Comet, as reported in the Comet performance EPIC.

The existing implementation relied on the generic Arrow length kernel for array
inputs, which introduces unnecessary overhead in vectorized execution. Since
octet_length semantics require computing the number of bytes in UTF-8 strings,
this can be implemented more efficiently using Arrow’s concrete string array APIs.

Optimizing this function in DataFusion improves performance for downstream projects
such as Comet and Spark without changing behavior or semantics.

What changes are included in this PR?

Replaced the use of the generic Arrow length kernel for array inputs in
octet_length
Added a specialized implementation for:
- StringArray
- LargeStringArray
- StringViewArray
Computed byte lengths directly using value_length, avoiding unnecessary
indirection and overhead
Left the scalar execution path unchanged

Are these changes tested?

Yes.

Existing unit tests for octet_length were executed and pass successfully
Core integration tests exercising octet_length also pass
No new tests were required, as existing coverage already validates correctness
across scalar and array inputs, including UTF-8 and null handling

Are there any user-facing changes?

No.

This change is purely a performance optimization and does not affect:

SQL syntax
Function semantics
Return types
Error behavior

Copilot

Pull request overview

This PR optimizes the octet_length function for string arrays by replacing the generic Arrow length kernel with specialized implementations that directly compute byte lengths using Arrow's concrete string array APIs.

Key changes:

Removed dependency on Arrow's generic length kernel for array inputs
Added specialized manual loop implementations for StringArray, LargeStringArray, and StringViewArray
Used value_length() method for StringArray/LargeStringArray and value().len() for StringViewArray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/functions/src/string/octet_length.rs

Brijesh-Thakkar · 2025-12-31T14:35:21Z

@andygrove Please review this PR

Brijesh-Thakkar · 2025-12-31T18:15:31Z

@andygrove Hi Andy, I’ve addressed the CI failures (rustfmt + clippy warnings) and pushed the fixes.
This PR optimizes octet_length by avoiding generic kernels for string arrays and using direct length access.
All tests are passing locally. Would appreciate your review when you have time.

perf: optimize octet_length for string arrays

ae3ebcb

Copilot AI review requested due to automatic review settings December 31, 2025 13:39

Merge branch 'main' into perf-octet-length

0201565

github-actions bot added the functions Changes to functions implementation label Dec 31, 2025

Copilot started reviewing on behalf of Brijesh-Thakkar December 31, 2025 13:40 View session

Brijesh-Thakkar mentioned this pull request Dec 31, 2025

[EPIC] Optimize performance for slow expressions apache/datafusion-comet#2986

Open

Copilot AI reviewed Dec 31, 2025

View reviewed changes

Brijesh-Thakkar added 2 commits December 31, 2025 19:31

refactor: simplify StringViewArray handling in octet_length

63bc5fd

perf: avoid unnecessary ArrayRef clone in octet_length

7c89fa0

Brijesh-Thakkar added 2 commits December 31, 2025 23:21

chore: rustfmt octet_length

a54b5d1

fix: clippy warnings and rustfmt for octet_length

7540b63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize octet_length for string arrays #19581

perf: optimize octet_length for string arrays #19581

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perf: optimize octet_length for string arrays #19581

Are you sure you want to change the base?

perf: optimize octet_length for string arrays #19581

Conversation

Brijesh-Thakkar commented Dec 31, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant