-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: optimize octet_length for string arrays #19581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the octet_length function for string arrays by replacing the generic Arrow length kernel with specialized implementations that directly compute byte lengths using Arrow's concrete string array APIs.
Key changes:
- Removed dependency on Arrow's generic
lengthkernel for array inputs - Added specialized manual loop implementations for StringArray, LargeStringArray, and StringViewArray
- Used
value_length()method for StringArray/LargeStringArray andvalue().len()for StringViewArray
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@andygrove Please review this PR |
|
@andygrove Hi Andy, I’ve addressed the CI failures (rustfmt + clippy warnings) and pushed the fixes. |
Which issue does this PR close?
Rationale for this change
The
octet_lengthscalar function showed significant performance degradation inSpark workloads when executed via Comet, as reported in the Comet performance EPIC.
The existing implementation relied on the generic Arrow
lengthkernel for arrayinputs, which introduces unnecessary overhead in vectorized execution. Since
octet_lengthsemantics require computing the number of bytes in UTF-8 strings,this can be implemented more efficiently using Arrow’s concrete string array APIs.
Optimizing this function in DataFusion improves performance for downstream projects
such as Comet and Spark without changing behavior or semantics.
What changes are included in this PR?
lengthkernel for array inputs inoctet_lengthStringArrayLargeStringArrayStringViewArrayvalue_length, avoiding unnecessaryindirection and overhead
Are these changes tested?
Yes.
octet_lengthwere executed and pass successfullyoctet_lengthalso passacross scalar and array inputs, including UTF-8 and null handling
Are there any user-facing changes?
No.
This change is purely a performance optimization and does not affect: