Skip to content

Conversation

@Brijesh-Thakkar
Copy link

Which issue does this PR close?

Rationale for this change

The octet_length scalar function showed significant performance degradation in
Spark workloads when executed via Comet, as reported in the Comet performance EPIC.

The existing implementation relied on the generic Arrow length kernel for array
inputs, which introduces unnecessary overhead in vectorized execution. Since
octet_length semantics require computing the number of bytes in UTF-8 strings,
this can be implemented more efficiently using Arrow’s concrete string array APIs.

Optimizing this function in DataFusion improves performance for downstream projects
such as Comet and Spark without changing behavior or semantics.

What changes are included in this PR?

  • Replaced the use of the generic Arrow length kernel for array inputs in
    octet_length
  • Added a specialized implementation for:
    • StringArray
    • LargeStringArray
    • StringViewArray
  • Computed byte lengths directly using value_length, avoiding unnecessary
    indirection and overhead
  • Left the scalar execution path unchanged

Are these changes tested?

Yes.

  • Existing unit tests for octet_length were executed and pass successfully
  • Core integration tests exercising octet_length also pass
  • No new tests were required, as existing coverage already validates correctness
    across scalar and array inputs, including UTF-8 and null handling

Are there any user-facing changes?

No.

This change is purely a performance optimization and does not affect:

  • SQL syntax
  • Function semantics
  • Return types
  • Error behavior

Copilot AI review requested due to automatic review settings December 31, 2025 13:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the octet_length function for string arrays by replacing the generic Arrow length kernel with specialized implementations that directly compute byte lengths using Arrow's concrete string array APIs.

Key changes:

  • Removed dependency on Arrow's generic length kernel for array inputs
  • Added specialized manual loop implementations for StringArray, LargeStringArray, and StringViewArray
  • Used value_length() method for StringArray/LargeStringArray and value().len() for StringViewArray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Brijesh-Thakkar
Copy link
Author

@andygrove Please review this PR

@Brijesh-Thakkar
Copy link
Author

@andygrove Hi Andy, I’ve addressed the CI failures (rustfmt + clippy warnings) and pushed the fixes.
This PR optimizes octet_length by avoiding generic kernels for string arrays and using direct length access.
All tests are passing locally. Would appreciate your review when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant