Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 3, 2026

Which issue does this PR close?

Rationale for this change

Donate some hash functions from Comet so that other projects can benefit from them.

The functions were initially implemented in Comet by @advancedxy

What changes are included in this PR?

I used Claude Code to copy the code from Comet and add slt tests. I manually verified that the expected values match Spark for a few cases just to be sure that the code is correct.

Are these changes tested?

Yes, tests are part of the PR.

Are there any user-facing changes?

No

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 3, 2026
Comment on lines +39 to +41
SELECT xxhash64('hello');
----
-4367754540140381902
Copy link
Member Author

@andygrove andygrove Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT xxhash64('hello')").show()
+--------------------+
|     xxhash64(hello)|
+--------------------+
|-4367754540140381902|
+--------------------+

Comment on lines +23 to +25
SELECT xxhash64(1);
----
-7001672635703045582
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT xxhash64(cast(1 as long))").show()
+---------------------------+
|xxhash64(CAST(1 AS BIGINT))|
+---------------------------+
|       -7001672635703045582|
+---------------------------+

Comment on lines +44 to +46
SELECT hash('hello');
----
-1008564952
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT hash('hello')").show()
+-----------+
|hash(hello)|
+-----------+
|-1008564952|
+-----------+

@andygrove andygrove marked this pull request as ready for review January 3, 2026 21:19
@andygrove andygrove requested a review from comphead January 3, 2026 21:34
@andygrove
Copy link
Member Author

@shehabgamin fyi

}
}

fn hash_column_murmur3(col: &ArrayRef, hashes: &mut [u32]) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like support for DataType::Dictionary may be missing. In the Sail codebase, we copied the logic from Comet, where the Dictionary type is handled. However, I’m not sure whether Comet’s implementation has changed since we copied it.

In Sail, the relevant logic can be found here:

Based on the attribution comments in those files, the corresponding Comet sources appear to come from the following commit:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the dictionary type, it seems that the FixedSizeBinary is not handled as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll address this in the next few days. Moving to draft for now.

@advancedxy
Copy link
Contributor

@andygrove Thanks for pinging and mentioning. The PR is in good shape and I'm glad it's contributed back to the upstream.

Comment on lines +80 to +88
// Determine number of rows from the first array argument
let num_rows = args
.args
.iter()
.find_map(|arg| match arg {
ColumnarValue::Array(array) => Some(array.len()),
ColumnarValue::Scalar(_) => None,
})
.unwrap_or(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Determine number of rows from the first array argument
let num_rows = args
.args
.iter()
.find_map(|arg| match arg {
ColumnarValue::Array(array) => Some(array.len()),
ColumnarValue::Scalar(_) => None,
})
.unwrap_or(1);
let num_rows = args.number_rows;

Comment on lines +93 to +103
// Convert all arguments to arrays
let arrays: Vec<ArrayRef> = args
.args
.iter()
.map(|arg| match arg {
ColumnarValue::Array(array) => Arc::clone(array),
ColumnarValue::Scalar(scalar) => scalar
.to_array_of_size(num_rows)
.expect("Failed to convert scalar to array"),
})
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Convert all arguments to arrays
let arrays: Vec<ArrayRef> = args
.args
.iter()
.map(|arg| match arg {
ColumnarValue::Array(array) => Arc::clone(array),
ColumnarValue::Scalar(scalar) => scalar
.to_array_of_size(num_rows)
.expect("Failed to convert scalar to array"),
})
.collect();
let arrays = ColumnarValue::values_to_arrays(&args.args)?;

Comment on lines +124 to +127
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
XxHash64::oneshot(seed, data.as_ref())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
XxHash64::oneshot(seed, data.as_ref())
}
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: i64) -> i64 {
XxHash64::oneshot(seed as u64, data.as_ref()) as i64
}

I wonder if it's worth doing this to make it easier to compute the resulting i64 array, without needing to convert the u64 hash vec to an i64 vec (unless compiler optimizes this away anyway 🤔 )

for i in (0..data.len()).step_by(4) {
let ints = data.as_ptr().add(i) as *const i32;
let mut half_word = ints.read_unaligned();
if cfg!(target_endian = "big") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a previous PR for this same functionality, I'll copy a previous comment: #17093 (comment)

I don't think big endian should be considered; here's a comment from arrow-rs about how that doesn't target big endian: apache/arrow-rs#6917 (comment)

So for simplicity we could just remove this cfg?

Comment on lines 178 to 179
// SAFETY: all operations are guaranteed to be safe
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious about these unsafe blocks; a safety comment like all operations are guaranteed to be safe isn't exactly reassuring 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comments were not copied over in full for some reason. I updated this.

Comment on lines +132 to +138
#[inline]
fn mix_k1(mut k1: i32) -> i32 {
k1 = k1.mul_wrapping(0xcc9e2d51u32 as i32);
k1 = k1.rotate_left(15);
k1.mul_wrapping(0x1b873593u32 as i32)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide a link to where this source code was extracted from? I'm assuming it was ported from some other implementation?

@andygrove andygrove marked this pull request as draft January 4, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants