20116: perf: Optimize scalar fast path & write() encoding for sha2 #227

martin-augment · 2026-02-03T07:47:12Z

20116: To review by AI

Co-authored-by: Jeffrey Vo <[email protected]>

gemini-code-assist · 2026-02-03T07:47:30Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the sha2 function within the DataFusion Spark compatibility module. It achieves this by introducing a specialized fast path for scalar inputs and optimizing the underlying hex encoding process. These changes are complemented by the addition of new benchmarks, which will help in validating and continuously monitoring the performance improvements.

Highlights

Optimized Scalar Fast Path for SHA2: The invoke_with_args method for the SparkSha2 function now includes a dedicated, optimized path for scalar inputs, handling various binary and UTF8 scalar types directly. This avoids the overhead of array expansion for single values, improving performance for common scalar operations.
Optimized Hex Encoding: The hex_encode function, crucial for formatting SHA2 outputs, has been rewritten for performance. It now uses a manual byte-to-hex conversion with a lookup table and unsafe { String::from_utf8_unchecked(out) }, replacing the slower write! macro approach.
New Benchmarks for SHA2: A new Criterion benchmark suite (sha2.rs) has been added to measure the performance of the SparkSha2 function. These benchmarks cover both scalar and array inputs, allowing for validation and monitoring of the performance gains from the introduced optimizations.
Refactored SHA2 Binary Processing: The internal logic for processing binary arrays with SHA2 has been refactored into a new generic helper function, sha2_binary_bitlen_iter. This function is now utilized by both array and scalar bit length cases, enhancing code reuse and clarity within the SHA2 implementation.

Changelog

datafusion/spark/Cargo.toml
- Added a new [[bench]] entry for sha2 benchmarks, enabling performance testing for the SHA2 function.
datafusion/spark/benches/sha2.rs
- New file added, implementing Criterion benchmarks for the SparkSha2 function.
- Includes benchmarks for scalar inputs and array inputs with a fixed bit length (256), using randomly generated binary data.
datafusion/spark/src/function/hash/sha2.rs
- Modified invoke_with_args to include an optimized fast path for scalar inputs and a specific path for array values with scalar bit lengths.
- Removed the std::fmt::Write import, reflecting the change in hex encoding strategy.
- Added new_null_array and ScalarValue imports for new functionality.
- Introduced sha2_binary_scalar_bitlen and sha2_binary_bitlen_iter to refactor and optimize binary array processing.
- Rewrote the hex_encode function for improved performance, utilizing a manual byte-to-hex conversion and String::from_utf8_unchecked.

Activity

The pull request description indicates that the author expects an AI review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-03T07:47:33Z

Walkthrough

This pull request introduces SparkSha2 hashing benchmarks to the datafusion-spark crate. Changes include: configuring sha2 as a benchmark target in Cargo.toml, adding a new benchmark file at datafusion/spark/benches/sha2.rs that measures performance of SparkSha2 operations using Criterion, and extending the SHA-2 function implementation in datafusion/spark/src/function/hash/sha2.rs to support hex-encoded outputs with explicit bit-length handling. The function updates include new array-bit-length processing helpers, per-element hex encoding using a lookup table, and extended scalar input handling for various binary and UTF-8 types.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20116-2026-02-03-07-47-08

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-02-03T07:49:17Z

Code Review

Summary

This PR optimizes the sha2 function by adding a scalar fast path and replacing fmt::Write with direct byte encoding for hex conversion. The changes include performance benchmarks and smart handling of the common case where the bit length is scalar.

Strengths

Performance Optimization: The custom hex_encode function using byte manipulation is a solid optimization over fmt::Write. This should provide measurable performance improvements.
Smart Fast Path: The scalar + scalar case optimization (lines 95-143) avoids unnecessary array expansion, which is excellent for query planning scenarios.
Good Benchmark Addition: The benchmark file provides meaningful performance testing with various sizes and scenarios.
Code Reuse: The refactoring to use sha2_binary_bitlen_iter (line 221) reduces duplication and makes the code more maintainable.

Issues and Concerns

1. Safety Concern - CRITICAL
Location: datafusion/spark/src/function/hash/sha2.rs:273

The unsafe String::from_utf8_unchecked(out) block is technically correct but risky:

The safety invariant isn't enforced by the type system
If HEX_CHARS is ever modified incorrectly, this becomes undefined behavior
Recommend adding a debug assertion or using String::from_utf8(out).expect() instead
The performance difference is negligible for this case

2. Missing Test Coverage
The PR adds performance benchmarks but no unit tests for the new fast paths:

Scalar + scalar path (lines 95-143)
Array + scalar bit length path (lines 146-165)
Null handling in fast paths (lines 169-177)

Recommend adding unit tests to verify:

Correctness of scalar fast path vs. fallback path
All supported binary types (Binary, LargeBinary, BinaryView, Utf8, etc.)
Edge cases: empty inputs, null values, invalid bit lengths
Equivalence between optimized and fallback paths

3. Code Duplication
Lines 119-141 and 233-252 duplicate the digest computation logic between the scalar fast path and the iterator-based path. Consider extracting a helper function to eliminate duplication.

Performance Considerations

Positive:

Hex encoding optimization should provide 2-3x speedup
Scalar fast path avoids array allocation overhead
Using std::iter::repeat for scalar bit length is zero-cost

Suggestion: Consider using #[cold] attribute on the fallback path (line 186) to hint branch prediction.

Security Considerations

No new security vulnerabilities introduced
Hash function behavior unchanged
The unsafe block should have additional safety documentation or be replaced with safe code

Recommendations Summary

Must Fix:

Add safety assertion or use safe UTF-8 conversion in hex_encode
Add unit tests for new fast paths

Should Fix:
3. Extract digest computation to eliminate code duplication
4. Consider adding #[cold] to fallback path

Nice to Have:
5. Document performance characteristics in code comments
6. Add benchmark results in PR description

Verdict

This is a solid performance optimization with good engineering. However, it needs safety improvements for the unsafe block and test coverage for the new code paths before merging.

Recommendation: Request changes for safety and test coverage.

Review generated by Claude Code

gemini-code-assist

Code Review

This pull request introduces performance optimizations for the sha2 function by adding a fast path for scalar arguments and improving the hex encoding implementation. The changes are well-structured and include corresponding benchmarks. My review includes suggestions to further improve maintainability by reducing code duplication, enhancing the hex encoding by using a standard crate, and expanding the benchmarks to cover all new fast paths.

gemini-code-assist · 2026-02-03T07:50:29Z

datafusion/spark/benches/sha2.rs

+    for &size in &sizes {
+        let values = generate_binary_data(size, null_density);
+        let bit_lengths = Int32Array::from(vec![256; size]);
+        let array_args = vec![
+            ColumnarValue::Array(Arc::new(values)),
+            ColumnarValue::Array(Arc::new(bit_lengths)),
+        ];
+        run_benchmark(c, "sha2/array_binary_256", size, &array_args);
+    }


The new fast path for (Array, Scalar) arguments is not benchmarked here. It would be beneficial to add a benchmark case for it to ensure its performance is as expected and to catch potential regressions in the future.

for &size in &sizes { let values = generate_binary_data(size, null_density); let bit_lengths = Int32Array::from(vec![256; size]); let array_args = vec![ ColumnarValue::Array(Arc::new(values.clone())), ColumnarValue::Array(Arc::new(bit_lengths)), ]; run_benchmark(c, "sha2/array_binary_256", size, &array_args); let array_scalar_args = vec![ ColumnarValue::Array(Arc::new(values)), ColumnarValue::Scalar(ScalarValue::Int32(Some(256))), ]; run_benchmark(c, "sha2/array_scalar_binary_256", size, &array_scalar_args); }

gemini-code-assist · 2026-02-03T07:50:29Z

datafusion/spark/src/function/hash/sha2.rs

+const HEX_CHARS: [u8; 16] = *b"0123456789abcdef";
+
+#[inline]
 fn hex_encode<T: AsRef<[u8]>>(data: T) -> String {
-    let mut s = String::with_capacity(data.as_ref().len() * 2);
-    for b in data.as_ref() {
-        // Writing to a string never errors, so we can unwrap here.
-        write!(&mut s, "{b:02x}").unwrap();
+    let bytes = data.as_ref();
+    let mut out = Vec::with_capacity(bytes.len() * 2);
+    for &b in bytes {
+        let hi = b >> 4;
+        let lo = b & 0x0F;
+        out.push(HEX_CHARS[hi as usize]);
+        out.push(HEX_CHARS[lo as usize]);
    }
-    s
+    // SAFETY: out contains only ASCII
+    unsafe { String::from_utf8_unchecked(out) }
 }


This manual implementation of hex encoding is a good performance improvement over write!. However, for even better performance, safety, and maintainability, consider using the hex crate, which is already available as a dependency through datafusion-functions.

The hex crate is highly optimized (often using SIMD) and would avoid the need for unsafe code here. You would need to add use hex; at the top of the file. This change also removes the need for the HEX_CHARS constant.

#[inline] fn hex_encode<T: AsRef<[u8]>>(data: T) -> String { hex::encode(data) }

augmentcode · 2026-02-03T07:51:16Z

🤖 Augment PR Summary

Summary: This PR optimizes the Spark-compatible sha2 scalar UDF implementation to reduce overhead for common invocation patterns.

Changes:

Adds a specialized fast path for sha2 when both arguments are scalars (avoids array expansion).
Adds a fast path for array values with a constant bit-length (e.g. sha2(col, 256)), using a repeated bit-length iterator.
Reworks hex encoding to a manual byte-to-ASCII conversion to reduce formatting overhead.
Adds a Criterion benchmark (datafusion/spark/benches/sha2.rs) and wires it into Cargo.toml.

Technical Notes: The implementation preserves Spark semantics for supported bit-lengths (224/256/384/512 and 0→256) and returns NULL for unknown bit-lengths; the benchmark covers both scalar and array cases.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-03T07:51:17Z

datafusion/spark/src/function/hash/sha2.rs

@@ -87,7 +88,98 @@ impl ScalarUDFImpl for SparkSha2 {
    }



args.args[0] / [1] will panic if invoke_with_args is ever called with the wrong arity; using take_function_args (as sha2_impl does) would keep this as a recoverable execution error instead of crashing.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-03T07:55:38Z

datafusion/spark/benches/sha2.rs

+        let array_args = vec![
+            ColumnarValue::Array(Arc::new(values)),
+            ColumnarValue::Array(Arc::new(bit_lengths)),
+        ];


Benchmark tests fallback path, not the optimized path

Low Severity

The array benchmark creates bit_lengths as a ColumnarValue::Array, but the new optimization added in invoke_with_args specifically handles the case where values is an array and bit_length is a ColumnarValue::Scalar (commented as "common case: sha2(col, 256)"). With the current benchmark setup, the array benchmark falls through to the fallback _ case rather than testing the new optimized sha2_binary_scalar_bitlen path. To properly benchmark the optimization, bit_lengths would be wrapped as ColumnarValue::Scalar(ScalarValue::Int32(Some(256))) instead of as an array.

kumarUjjawal and others added 4 commits February 2, 2026 23:48

perf: Optimize scalar fast path & write() encoding for sha2

9b6d715

fix clippy

1e3c875

suggestion from jeffrey

4bc388b

Co-authored-by: Jeffrey Vo <[email protected]>

fast path array, and hex hi,lo check

fab5c2e

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

augmentcode bot reviewed Feb 3, 2026

View reviewed changes

cursor bot reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20116: perf: Optimize scalar fast path & write() encoding for sha2 #227

20116: perf: Optimize scalar fast path & write() encoding for sha2 #227

Uh oh!

martin-augment commented Feb 3, 2026

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

augmentcode bot commented Feb 3, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Feb 3, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

20116: perf: Optimize scalar fast path & write() encoding for sha2 #227

Are you sure you want to change the base?

20116: perf: Optimize scalar fast path & write() encoding for sha2 #227

Uh oh!

Conversation

martin-augment commented Feb 3, 2026

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Feb 3, 2026

Code Review

Summary

Strengths

Issues and Concerns

Performance Considerations

Security Considerations

Recommendations Summary

Verdict

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Feb 3, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 3, 2026

Choose a reason for hiding this comment

Benchmark tests fallback path, not the optimized path

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Feb 3, 2026 •

edited

Loading