Skip to content

Conversation

@pepijnve
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Spark's bit_count function always operators on 64-bit values, while the original bit_count implementation in datafusion_spark operated on the native size of the input value.
In order to fix this a custom bit counting implementation was ported over from the Java Spark implementation. This isn't really necessary though. Widening signed integers to i64 and then using i64::count_ones will get you the exact same result and is less obscure.

What changes are included in this PR?

Remove custom bitcount logic and use i64::count_ones instead.

Are these changes tested?

Covered by existing tests that were added for #18225

Are there any user-facing changes?

No

@github-actions github-actions bot added the spark label Nov 20, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement to me -- thanks @pepijnve

@alamb alamb changed the title Remove unnecessary bit counting code Remove unnecessary bit counting code from spark bit_count Nov 20, 2025
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve in Spark/JVM and Rust sometimes there are discrepancies, like treating decimals, regexp, etc.

please add tests for booleans T/F/null

@pepijnve
Copy link
Contributor Author

Thanks @pepijnve in Spark/JVM and Rust sometimes there are discrepancies, like treating decimals, regexp, etc.

Yep, I understand that. What was a bit puzzling initially was that there was no escription of what was actually different and why the port of the Java “count ones” implementation was being added.

The difference was that the original DataFusion implementation was operating on the native size of the signed integer input values, while Spark always operates on Java long (i.e. i64). For unsigned and non negative signed integers that not an issue since the answer is the same. For negative integers though you get a different result since those are padded with 1s when widened.

There’s absolutely no need for a custom popcount implementation to fix this. Just widen to i64 and use count_ones.

please add tests for booleans T/F/null

That code path was not touched in this PR at all. Not sure why I should add tests for code that’s not being added or modified.

}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this algorithm is a SWAR hamming weight implementation. Per the code comments in java.lang.Long, this comes from Hacker's Delight.

What's interesting is that the Rust compiler generates something very similar when calling count_ones().

With a sufficiently recent target architecture though you get popcnt instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The popcnt instruction is crazy fast -- we tested it in one example where we had a special codepath for no nulls, and I was worried that calculating the test if nulls.count_ones() == 0 would overwhelm the improvement

Nowhere close. 🚀

Hackers Delight is a classic -- I am not at all surprised that the Rust compiler includes all those tricks (and then some!)

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor 👍

@alamb alamb added this pull request to the merge queue Nov 21, 2025
Merged via the queue into apache:main with commit d65fb86 Nov 21, 2025
29 checks passed
@comphead
Copy link
Contributor

That code path was not touched in this PR at all. Not sure why I should add tests for code that’s not being added or modified.

The good practice is to keep test set as complete as possible before refactoring, that was the reason to ask add missing bool tests.

I added them in #18871

@pepijnve
Copy link
Contributor Author

pepijnve commented Nov 21, 2025

The good practice is to keep test set as complete as possible before refactoring

I understand what you're saying, but I don't agree with placing that burden on people when making unrelated changes. "You touched a file, so please increase the code coverage for unmodified code paths first" seems a bit contributor hostile. If I was changing the boolean code path I would fully agree that you make test first, then refactor, to make sure you're not changing behaviour.

@comphead
Copy link
Contributor

Yeah, totally agree on isolated changes.

@alamb
Copy link
Contributor

alamb commented Nov 21, 2025

please add tests for booleans T/F/null

Sorry @comphead I missed this

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change
Follow up on #18841

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

Adding missing bool tests for bit_count

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants