Remove unnecessary bit counting code from spark `bit_count` #18841

pepijnve · 2025-11-20T10:22:10Z

Which issue does this PR close?

Followup to bit_count in spark create not fuly compatible with spark #18225 and PR Fix: spark bit_count function #18322

Rationale for this change

Spark's bit_count function always operators on 64-bit values, while the original bit_count implementation in datafusion_spark operated on the native size of the input value.
In order to fix this a custom bit counting implementation was ported over from the Java Spark implementation. This isn't really necessary though. Widening signed integers to i64 and then using i64::count_ones will get you the exact same result and is less obscure.

What changes are included in this PR?

Remove custom bitcount logic and use i64::count_ones instead.

Are these changes tested?

Covered by existing tests that were added for #18225

Are there any user-facing changes?

No

alamb

Looks like a nice improvement to me -- thanks @pepijnve

comphead

Thanks @pepijnve in Spark/JVM and Rust sometimes there are discrepancies, like treating decimals, regexp, etc.

please add tests for booleans T/F/null

pepijnve · 2025-11-21T06:47:21Z

Thanks @pepijnve in Spark/JVM and Rust sometimes there are discrepancies, like treating decimals, regexp, etc.

Yep, I understand that. What was a bit puzzling initially was that there was no escription of what was actually different and why the port of the Java “count ones” implementation was being added.

The difference was that the original DataFusion implementation was operating on the native size of the signed integer input values, while Spark always operates on Java long (i.e. i64). For unsigned and non negative signed integers that not an issue since the answer is the same. For negative integers though you get a different result since those are padded with 1s when widened.

There’s absolutely no need for a custom popcount implementation to fix this. Just widen to i64 and use count_ones.

please add tests for booleans T/F/null

That code path was not touched in this PR at all. Not sure why I should add tests for code that’s not being added or modified.

pepijnve · 2025-11-21T11:27:56Z

datafusion/spark/src/function/bitwise/bit_count.rs

        }
    }
 }



FYI, this algorithm is a SWAR hamming weight implementation. Per the code comments in java.lang.Long, this comes from Hacker's Delight.

What's interesting is that the Rust compiler generates something very similar when calling count_ones().

With a sufficiently recent target architecture though you get popcnt instead.

The popcnt instruction is crazy fast -- we tested it in one example where we had a special codepath for no nulls, and I was worried that calculating the test if nulls.count_ones() == 0 would overwhelm the improvement

Nowhere close. 🚀

Hackers Delight is a classic -- I am not at all surprised that the Rust compiler includes all those tricks (and then some!)

Jefffrey

Nice refactor 👍

comphead · 2025-11-21T16:28:03Z

That code path was not touched in this PR at all. Not sure why I should add tests for code that’s not being added or modified.

The good practice is to keep test set as complete as possible before refactoring, that was the reason to ask add missing bool tests.

I added them in #18871

pepijnve · 2025-11-21T16:42:05Z

The good practice is to keep test set as complete as possible before refactoring

I understand what you're saying, but I don't agree with placing that burden on people when making unrelated changes. "You touched a file, so please increase the code coverage for unmodified code paths first" seems a bit contributor hostile. If I was changing the boolean code path I would fully agree that you make test first, then refactor, to make sure you're not changing behaviour.

comphead · 2025-11-21T17:24:45Z

Yeah, totally agree on isolated changes.

alamb · 2025-11-21T18:11:18Z

please add tests for booleans T/F/null

Sorry @comphead I missed this

## Which issue does this PR close?  - Closes #. ## Rationale for this change Follow up on #18841  ## What changes are included in this PR? Adding missing bool tests for bit_count  ## Are these changes tested?  ## Are there any user-facing changes?

Remove unnecessary bit counting code

8960159

github-actions bot added the spark label Nov 20, 2025

alamb approved these changes Nov 20, 2025

View reviewed changes

martin-g approved these changes Nov 20, 2025

View reviewed changes

alamb changed the title ~~Remove unnecessary bit counting code~~ Remove unnecessary bit counting code from spark bit_count Nov 20, 2025

comphead reviewed Nov 21, 2025

View reviewed changes

pepijnve commented Nov 21, 2025

View reviewed changes

Jefffrey approved these changes Nov 21, 2025

View reviewed changes

alamb added this pull request to the merge queue Nov 21, 2025

Merged via the queue into apache:main with commit d65fb86 Nov 21, 2025
29 checks passed

comphead mentioned this pull request Nov 21, 2025

chore: Add missing boolean tests to bit_count Spark function #18871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unnecessary bit counting code from spark `bit_count` #18841

Remove unnecessary bit counting code from spark `bit_count` #18841

pepijnve commented Nov 20, 2025

Uh oh!

alamb left a comment

Uh oh!

comphead left a comment

Uh oh!

pepijnve commented Nov 21, 2025

Uh oh!

pepijnve Nov 21, 2025

Uh oh!

alamb Nov 21, 2025

Uh oh!

Jefffrey left a comment

Uh oh!

Uh oh!

comphead commented Nov 21, 2025

Uh oh!

pepijnve commented Nov 21, 2025 •

edited

Loading

Uh oh!

comphead commented Nov 21, 2025

Uh oh!

alamb commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Remove unnecessary bit counting code from spark bit_count #18841

Remove unnecessary bit counting code from spark bit_count #18841

Conversation

pepijnve commented Nov 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

pepijnve commented Nov 21, 2025

Uh oh!

pepijnve Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

comphead commented Nov 21, 2025

Uh oh!

pepijnve commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Nov 21, 2025

Uh oh!

alamb commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Remove unnecessary bit counting code from spark `bit_count` #18841

Remove unnecessary bit counting code from spark `bit_count` #18841

pepijnve commented Nov 21, 2025 •

edited

Loading