Skip to content

Conversation

@martin-augment
Copy link
Owner

20116: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the sha2 function within the DataFusion Spark compatibility module. It achieves this by introducing a specialized fast path for scalar inputs and optimizing the underlying hex encoding process. These changes are complemented by the addition of new benchmarks, which will help in validating and continuously monitoring the performance improvements.

Highlights

  • Optimized Scalar Fast Path for SHA2: The invoke_with_args method for the SparkSha2 function now includes a dedicated, optimized path for scalar inputs, handling various binary and UTF8 scalar types directly. This avoids the overhead of array expansion for single values, improving performance for common scalar operations.
  • Optimized Hex Encoding: The hex_encode function, crucial for formatting SHA2 outputs, has been rewritten for performance. It now uses a manual byte-to-hex conversion with a lookup table and unsafe { String::from_utf8_unchecked(out) }, replacing the slower write! macro approach.
  • New Benchmarks for SHA2: A new Criterion benchmark suite (sha2.rs) has been added to measure the performance of the SparkSha2 function. These benchmarks cover both scalar and array inputs, allowing for validation and monitoring of the performance gains from the introduced optimizations.
  • Refactored SHA2 Binary Processing: The internal logic for processing binary arrays with SHA2 has been refactored into a new generic helper function, sha2_binary_bitlen_iter. This function is now utilized by both array and scalar bit length cases, enhancing code reuse and clarity within the SHA2 implementation.
Changelog
  • datafusion/spark/Cargo.toml
    • Added a new [[bench]] entry for sha2 benchmarks, enabling performance testing for the SHA2 function.
  • datafusion/spark/benches/sha2.rs
    • New file added, implementing Criterion benchmarks for the SparkSha2 function.
    • Includes benchmarks for scalar inputs and array inputs with a fixed bit length (256), using randomly generated binary data.
  • datafusion/spark/src/function/hash/sha2.rs
    • Modified invoke_with_args to include an optimized fast path for scalar inputs and a specific path for array values with scalar bit lengths.
    • Removed the std::fmt::Write import, reflecting the change in hex encoding strategy.
    • Added new_null_array and ScalarValue imports for new functionality.
    • Introduced sha2_binary_scalar_bitlen and sha2_binary_bitlen_iter to refactor and optimize binary array processing.
    • Rewrote the hex_encode function for improved performance, utilizing a manual byte-to-hex conversion and String::from_utf8_unchecked.
Activity
  • The pull request description indicates that the author expects an AI review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

Walkthrough

This pull request introduces SparkSha2 hashing benchmarks to the datafusion-spark crate. Changes include: configuring sha2 as a benchmark target in Cargo.toml, adding a new benchmark file at datafusion/spark/benches/sha2.rs that measures performance of SparkSha2 operations using Criterion, and extending the SHA-2 function implementation in datafusion/spark/src/function/hash/sha2.rs to support hex-encoded outputs with explicit bit-length handling. The function updates include new array-bit-length processing helpers, per-element hex encoding using a lookup table, and extended scalar input handling for various binary and UTF-8 types.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20116-2026-02-03-07-47-08

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 3, 2026

Code Review

Summary

This PR optimizes the sha2 function by adding a scalar fast path and replacing fmt::Write with direct byte encoding for hex conversion. The changes include performance benchmarks and smart handling of the common case where the bit length is scalar.


Strengths

  1. Performance Optimization: The custom hex_encode function using byte manipulation is a solid optimization over fmt::Write. This should provide measurable performance improvements.

  2. Smart Fast Path: The scalar + scalar case optimization (lines 95-143) avoids unnecessary array expansion, which is excellent for query planning scenarios.

  3. Good Benchmark Addition: The benchmark file provides meaningful performance testing with various sizes and scenarios.

  4. Code Reuse: The refactoring to use sha2_binary_bitlen_iter (line 221) reduces duplication and makes the code more maintainable.


Issues and Concerns

1. Safety Concern - CRITICAL
Location: datafusion/spark/src/function/hash/sha2.rs:273

The unsafe String::from_utf8_unchecked(out) block is technically correct but risky:

  • The safety invariant isn't enforced by the type system
  • If HEX_CHARS is ever modified incorrectly, this becomes undefined behavior
  • Recommend adding a debug assertion or using String::from_utf8(out).expect() instead
  • The performance difference is negligible for this case

2. Missing Test Coverage
The PR adds performance benchmarks but no unit tests for the new fast paths:

  • Scalar + scalar path (lines 95-143)
  • Array + scalar bit length path (lines 146-165)
  • Null handling in fast paths (lines 169-177)

Recommend adding unit tests to verify:

  • Correctness of scalar fast path vs. fallback path
  • All supported binary types (Binary, LargeBinary, BinaryView, Utf8, etc.)
  • Edge cases: empty inputs, null values, invalid bit lengths
  • Equivalence between optimized and fallback paths

3. Code Duplication
Lines 119-141 and 233-252 duplicate the digest computation logic between the scalar fast path and the iterator-based path. Consider extracting a helper function to eliminate duplication.


Performance Considerations

Positive:

  • Hex encoding optimization should provide 2-3x speedup
  • Scalar fast path avoids array allocation overhead
  • Using std::iter::repeat for scalar bit length is zero-cost

Suggestion: Consider using #[cold] attribute on the fallback path (line 186) to hint branch prediction.


Security Considerations

  • No new security vulnerabilities introduced
  • Hash function behavior unchanged
  • The unsafe block should have additional safety documentation or be replaced with safe code

Recommendations Summary

Must Fix:

  1. Add safety assertion or use safe UTF-8 conversion in hex_encode
  2. Add unit tests for new fast paths

Should Fix:
3. Extract digest computation to eliminate code duplication
4. Consider adding #[cold] to fallback path

Nice to Have:
5. Document performance characteristics in code comments
6. Add benchmark results in PR description


Verdict

This is a solid performance optimization with good engineering. However, it needs safety improvements for the unsafe block and test coverage for the new code paths before merging.

Recommendation: Request changes for safety and test coverage.


Review generated by Claude Code

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for the sha2 function by adding a fast path for scalar arguments and improving the hex encoding implementation. The changes are well-structured and include corresponding benchmarks. My review includes suggestions to further improve maintainability by reducing code duplication, enhancing the hex encoding by using a standard crate, and expanding the benchmarks to cover all new fast paths.

Comment on lines +88 to +96
for &size in &sizes {
let values = generate_binary_data(size, null_density);
let bit_lengths = Int32Array::from(vec![256; size]);
let array_args = vec![
ColumnarValue::Array(Arc::new(values)),
ColumnarValue::Array(Arc::new(bit_lengths)),
];
run_benchmark(c, "sha2/array_binary_256", size, &array_args);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new fast path for (Array, Scalar) arguments is not benchmarked here. It would be beneficial to add a benchmark case for it to ensure its performance is as expected and to catch potential regressions in the future.

    for &size in &sizes {
        let values = generate_binary_data(size, null_density);
        let bit_lengths = Int32Array::from(vec![256; size]);
        let array_args = vec![
            ColumnarValue::Array(Arc::new(values.clone())),
            ColumnarValue::Array(Arc::new(bit_lengths)),
        ];
        run_benchmark(c, "sha2/array_binary_256", size, &array_args);

        let array_scalar_args = vec![
            ColumnarValue::Array(Arc::new(values)),
            ColumnarValue::Scalar(ScalarValue::Int32(Some(256))),
        ];
        run_benchmark(c, "sha2/array_scalar_binary_256", size, &array_scalar_args);
    }

Comment on lines +260 to 274
const HEX_CHARS: [u8; 16] = *b"0123456789abcdef";

#[inline]
fn hex_encode<T: AsRef<[u8]>>(data: T) -> String {
let mut s = String::with_capacity(data.as_ref().len() * 2);
for b in data.as_ref() {
// Writing to a string never errors, so we can unwrap here.
write!(&mut s, "{b:02x}").unwrap();
let bytes = data.as_ref();
let mut out = Vec::with_capacity(bytes.len() * 2);
for &b in bytes {
let hi = b >> 4;
let lo = b & 0x0F;
out.push(HEX_CHARS[hi as usize]);
out.push(HEX_CHARS[lo as usize]);
}
s
// SAFETY: out contains only ASCII
unsafe { String::from_utf8_unchecked(out) }
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual implementation of hex encoding is a good performance improvement over write!. However, for even better performance, safety, and maintainability, consider using the hex crate, which is already available as a dependency through datafusion-functions.

The hex crate is highly optimized (often using SIMD) and would avoid the need for unsafe code here. You would need to add use hex; at the top of the file. This change also removes the need for the HEX_CHARS constant.

#[inline]
fn hex_encode<T: AsRef<[u8]>>(data: T) -> String {
    hex::encode(data)
}

@augmentcode
Copy link

augmentcode bot commented Feb 3, 2026

🤖 Augment PR Summary

Summary: This PR optimizes the Spark-compatible sha2 scalar UDF implementation to reduce overhead for common invocation patterns.

Changes:

  • Adds a specialized fast path for sha2 when both arguments are scalars (avoids array expansion).
  • Adds a fast path for array values with a constant bit-length (e.g. sha2(col, 256)), using a repeated bit-length iterator.
  • Reworks hex encoding to a manual byte-to-ASCII conversion to reduce formatting overhead.
  • Adds a Criterion benchmark (datafusion/spark/benches/sha2.rs) and wires it into Cargo.toml.

Technical Notes: The implementation preserves Spark semantics for supported bit-lengths (224/256/384/512 and 0→256) and returns NULL for unknown bit-lengths; the benchmark covers both scalar and array cases.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

@@ -87,7 +88,98 @@ impl ScalarUDFImpl for SparkSha2 {
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args.args[0] / [1] will panic if invoke_with_args is ever called with the wrong arity; using take_function_args (as sha2_impl does) would keep this as a recoverable execution error instead of crashing.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

let array_args = vec![
ColumnarValue::Array(Arc::new(values)),
ColumnarValue::Array(Arc::new(bit_lengths)),
];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark tests fallback path, not the optimized path

Low Severity

The array benchmark creates bit_lengths as a ColumnarValue::Array, but the new optimization added in invoke_with_args specifically handles the case where values is an array and bit_length is a ColumnarValue::Scalar (commented as "common case: sha2(col, 256)"). With the current benchmark setup, the array benchmark falls through to the fallback _ case rather than testing the new optimized sha2_binary_scalar_bitlen path. To properly benchmark the optimization, bit_lengths would be wrapped as ColumnarValue::Scalar(ScalarValue::Int32(Some(256))) instead of as an array.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants