feat: Add Spark-compatible `xxhash64` and `murmur3` hash functions #19627

andygrove · 2026-01-03T21:10:43Z

Which issue does this PR close?

Partially addresses Implement xxhash algorithms as part of the expression API #14044

Rationale for this change

Donate some hash functions from Comet so that other projects can benefit from them.

The functions were initially implemented in Comet by @advancedxy

What changes are included in this PR?

I used Claude Code to copy the code from Comet and add slt tests. I manually verified that the expected values match Spark for a few cases just to be sure that the code is correct.

Are these changes tested?

Yes, tests are part of the PR.

Are there any user-facing changes?

No

andygrove · 2026-01-03T21:16:56Z

datafusion/sqllogictest/test_files/spark/hash/xxhash64.slt

+SELECT xxhash64('hello');
+----
+-4367754540140381902


scala> spark.sql("SELECT xxhash64('hello')").show() +--------------------+ | xxhash64(hello)| +--------------------+ |-4367754540140381902| +--------------------+

andygrove · 2026-01-03T21:18:03Z

datafusion/sqllogictest/test_files/spark/hash/xxhash64.slt

+SELECT xxhash64(1);
+----
+-7001672635703045582


scala> spark.sql("SELECT xxhash64(cast(1 as long))").show() +---------------------------+ |xxhash64(CAST(1 AS BIGINT))| +---------------------------+ | -7001672635703045582| +---------------------------+

andygrove · 2026-01-03T21:19:16Z

datafusion/sqllogictest/test_files/spark/hash/murmur3_hash.slt

+SELECT hash('hello');
+----
+-1008564952


scala> spark.sql("SELECT hash('hello')").show() +-----------+ |hash(hello)| +-----------+ |-1008564952| +-----------+

andygrove · 2026-01-03T21:35:01Z

@shehabgamin fyi

shehabgamin · 2026-01-04T01:55:04Z

datafusion/spark/src/function/hash/murmur3_hash.rs

+    }
+}
+
+fn hash_column_murmur3(col: &ArrayRef, hashes: &mut [u32]) -> Result<()> {


It looks like support for DataType::Dictionary may be missing. In the Sail codebase, we copied the logic from Comet, where the Dictionary type is handled. However, I’m not sure whether Comet’s implementation has changed since we copied it.

In Sail, the relevant logic can be found here:

https://github.com/lakehq/sail/blob/540fb8350ab676dfd0c302fafb4176b11fb0ee84/crates/sail-function/src/scalar/hash/spark_murmur3_hash.rs#L68

https://github.com/lakehq/sail/blob/540fb8350ab676dfd0c302fafb4176b11fb0ee84/crates/sail-function/src/scalar/hash/utils.rs#L12

Based on the attribution comments in those files, the corresponding Comet sources appear to come from the following commit:

https://github.com/apache/datafusion-comet/blob/bfd7054c02950219561428463d3926afaf8edbba/native/spark-expr/src/spark_hash.rs

https://github.com/apache/datafusion-comet/blob/bfd7054c02950219561428463d3926afaf8edbba/native/spark-expr/src/scalar_funcs/hash_expressions.rs#L28-L70

Except the dictionary type, it seems that the FixedSizeBinary is not handled as well.

Thanks, I'll address this in the next few days. Moving to draft for now.

advancedxy · 2026-01-04T06:46:38Z

@andygrove Thanks for pinging and mentioning. The PR is in good shape and I'm glad it's contributed back to the upstream.

Jefffrey · 2026-01-04T07:17:21Z

datafusion/spark/src/function/hash/xxhash64.rs

+        // Determine number of rows from the first array argument
+        let num_rows = args
+            .args
+            .iter()
+            .find_map(|arg| match arg {
+                ColumnarValue::Array(array) => Some(array.len()),
+                ColumnarValue::Scalar(_) => None,
+            })
+            .unwrap_or(1);


Suggested change

// Determine number of rows from the first array argument

let num_rows = args

.args

.iter()

.find_map(|arg| match arg {

ColumnarValue::Array(array) => Some(array.len()),

ColumnarValue::Scalar(_) => None,

})

.unwrap_or(1);

let num_rows = args.number_rows;

Jefffrey · 2026-01-04T07:17:41Z

datafusion/spark/src/function/hash/xxhash64.rs

+        // Convert all arguments to arrays
+        let arrays: Vec<ArrayRef> = args
+            .args
+            .iter()
+            .map(|arg| match arg {
+                ColumnarValue::Array(array) => Arc::clone(array),
+                ColumnarValue::Scalar(scalar) => scalar
+                    .to_array_of_size(num_rows)
+                    .expect("Failed to convert scalar to array"),
+            })
+            .collect();


Suggested change

// Convert all arguments to arrays

let arrays: Vec<ArrayRef> = args

.args

.iter()

.map(|arg| match arg {

ColumnarValue::Array(array) => Arc::clone(array),

ColumnarValue::Scalar(scalar) => scalar

.to_array_of_size(num_rows)

.expect("Failed to convert scalar to array"),

})

.collect();

let arrays = ColumnarValue::values_to_arrays(&args.args)?;

Jefffrey · 2026-01-04T07:18:40Z

datafusion/spark/src/function/hash/xxhash64.rs

+#[inline]
+fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
+    XxHash64::oneshot(seed, data.as_ref())
+}


Suggested change

#[inline]

fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {

XxHash64::oneshot(seed, data.as_ref())

}

#[inline]

fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: i64) -> i64 {

XxHash64::oneshot(seed as u64, data.as_ref()) as i64

}

I wonder if it's worth doing this to make it easier to compute the resulting i64 array, without needing to convert the u64 hash vec to an i64 vec (unless compiler optimizes this away anyway 🤔 )

Jefffrey · 2026-01-04T07:23:51Z

datafusion/spark/src/function/hash/murmur3_hash.rs

+            for i in (0..data.len()).step_by(4) {
+                let ints = data.as_ptr().add(i) as *const i32;
+                let mut half_word = ints.read_unaligned();
+                if cfg!(target_endian = "big") {


I remember a previous PR for this same functionality, I'll copy a previous comment: #17093 (comment)

I don't think big endian should be considered; here's a comment from arrow-rs about how that doesn't target big endian: apache/arrow-rs#6917 (comment)

So for simplicity we could just remove this cfg?

Jefffrey · 2026-01-04T07:25:01Z

datafusion/spark/src/function/hash/murmur3_hash.rs

+    // SAFETY: all operations are guaranteed to be safe
+    unsafe {


I'm a bit curious about these unsafe blocks; a safety comment like all operations are guaranteed to be safe isn't exactly reassuring 😅

The original comments were not copied over in full for some reason. I updated this.

Jefffrey · 2026-01-04T07:25:43Z

datafusion/spark/src/function/hash/murmur3_hash.rs

+    #[inline]
+    fn mix_k1(mut k1: i32) -> i32 {
+        k1 = k1.mul_wrapping(0xcc9e2d51u32 as i32);
+        k1 = k1.rotate_left(15);
+        k1.mul_wrapping(0x1b873593u32 as i32)
+    }
+


Do we need to provide a link to where this source code was extracted from? I'm assuming it was ported from some other implementation?

Add Spark-compatible xxhash64 and murmur3 hash functions

6f1e8ca

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 3, 2026

cargo fmt

88ed9bd

andygrove mentioned this pull request Jan 3, 2026

Implement xxhash algorithms as part of the expression API #14044

Open

andygrove commented Jan 3, 2026

View reviewed changes

andygrove marked this pull request as ready for review January 3, 2026 21:19

andygrove requested a review from comphead January 3, 2026 21:34

shehabgamin reviewed Jan 4, 2026

View reviewed changes

Jefffrey reviewed Jan 4, 2026

View reviewed changes

andygrove marked this pull request as draft January 4, 2026 18:37

andygrove added 3 commits January 4, 2026 16:14

add dictionary and fixed size binary support

6aa1703

fix comment

a66166d

cargo fmt

75bad8a

feat: Add Spark-compatible xxhash64 and murmur3 hash functions #19627

Are you sure you want to change the base?

feat: Add Spark-compatible xxhash64 and murmur3 hash functions #19627

Conversation

andygrove commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

andygrove Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jan 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy commented Jan 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Add Spark-compatible `xxhash64` and `murmur3` hash functions #19627

feat: Add Spark-compatible `xxhash64` and `murmur3` hash functions #19627

andygrove commented Jan 3, 2026 •

edited

Loading

andygrove Jan 3, 2026 •

edited

Loading