-
Notifications
You must be signed in to change notification settings - Fork 0
20116: perf: Optimize scalar fast path & write() encoding for sha2 #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request introduces SparkSha2 hashing benchmarks to the datafusion-spark crate. Changes include: configuring sha2 as a benchmark target in Cargo.toml, adding a new benchmark file at ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewSummaryThis PR optimizes the sha2 function by adding a scalar fast path and replacing fmt::Write with direct byte encoding for hex conversion. The changes include performance benchmarks and smart handling of the common case where the bit length is scalar. Strengths
Issues and Concerns1. Safety Concern - CRITICAL The unsafe String::from_utf8_unchecked(out) block is technically correct but risky:
2. Missing Test Coverage
Recommend adding unit tests to verify:
3. Code Duplication Performance ConsiderationsPositive:
Suggestion: Consider using #[cold] attribute on the fallback path (line 186) to hint branch prediction. Security Considerations
Recommendations SummaryMust Fix:
Should Fix: Nice to Have: VerdictThis is a solid performance optimization with good engineering. However, it needs safety improvements for the unsafe block and test coverage for the new code paths before merging. Recommendation: Request changes for safety and test coverage. Review generated by Claude Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations for the sha2 function by adding a fast path for scalar arguments and improving the hex encoding implementation. The changes are well-structured and include corresponding benchmarks. My review includes suggestions to further improve maintainability by reducing code duplication, enhancing the hex encoding by using a standard crate, and expanding the benchmarks to cover all new fast paths.
| for &size in &sizes { | ||
| let values = generate_binary_data(size, null_density); | ||
| let bit_lengths = Int32Array::from(vec![256; size]); | ||
| let array_args = vec![ | ||
| ColumnarValue::Array(Arc::new(values)), | ||
| ColumnarValue::Array(Arc::new(bit_lengths)), | ||
| ]; | ||
| run_benchmark(c, "sha2/array_binary_256", size, &array_args); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new fast path for (Array, Scalar) arguments is not benchmarked here. It would be beneficial to add a benchmark case for it to ensure its performance is as expected and to catch potential regressions in the future.
for &size in &sizes {
let values = generate_binary_data(size, null_density);
let bit_lengths = Int32Array::from(vec![256; size]);
let array_args = vec![
ColumnarValue::Array(Arc::new(values.clone())),
ColumnarValue::Array(Arc::new(bit_lengths)),
];
run_benchmark(c, "sha2/array_binary_256", size, &array_args);
let array_scalar_args = vec![
ColumnarValue::Array(Arc::new(values)),
ColumnarValue::Scalar(ScalarValue::Int32(Some(256))),
];
run_benchmark(c, "sha2/array_scalar_binary_256", size, &array_scalar_args);
}| const HEX_CHARS: [u8; 16] = *b"0123456789abcdef"; | ||
|
|
||
| #[inline] | ||
| fn hex_encode<T: AsRef<[u8]>>(data: T) -> String { | ||
| let mut s = String::with_capacity(data.as_ref().len() * 2); | ||
| for b in data.as_ref() { | ||
| // Writing to a string never errors, so we can unwrap here. | ||
| write!(&mut s, "{b:02x}").unwrap(); | ||
| let bytes = data.as_ref(); | ||
| let mut out = Vec::with_capacity(bytes.len() * 2); | ||
| for &b in bytes { | ||
| let hi = b >> 4; | ||
| let lo = b & 0x0F; | ||
| out.push(HEX_CHARS[hi as usize]); | ||
| out.push(HEX_CHARS[lo as usize]); | ||
| } | ||
| s | ||
| // SAFETY: out contains only ASCII | ||
| unsafe { String::from_utf8_unchecked(out) } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This manual implementation of hex encoding is a good performance improvement over write!. However, for even better performance, safety, and maintainability, consider using the hex crate, which is already available as a dependency through datafusion-functions.
The hex crate is highly optimized (often using SIMD) and would avoid the need for unsafe code here. You would need to add use hex; at the top of the file. This change also removes the need for the HEX_CHARS constant.
#[inline]
fn hex_encode<T: AsRef<[u8]>>(data: T) -> String {
hex::encode(data)
}
🤖 Augment PR SummarySummary: This PR optimizes the Spark-compatible Changes:
Technical Notes: The implementation preserves Spark semantics for supported bit-lengths (224/256/384/512 and 0→256) and returns NULL for unknown bit-lengths; the benchmark covers both scalar and array cases. 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -87,7 +88,98 @@ impl ScalarUDFImpl for SparkSha2 { | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| let array_args = vec![ | ||
| ColumnarValue::Array(Arc::new(values)), | ||
| ColumnarValue::Array(Arc::new(bit_lengths)), | ||
| ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark tests fallback path, not the optimized path
Low Severity
The array benchmark creates bit_lengths as a ColumnarValue::Array, but the new optimization added in invoke_with_args specifically handles the case where values is an array and bit_length is a ColumnarValue::Scalar (commented as "common case: sha2(col, 256)"). With the current benchmark setup, the array benchmark falls through to the fallback _ case rather than testing the new optimized sha2_binary_scalar_bitlen path. To properly benchmark the optimization, bit_lengths would be wrapped as ColumnarValue::Scalar(ScalarValue::Int32(Some(256))) instead of as an array.


20116: To review by AI