-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Remove unnecessary bit counting code from spark bit_count
#18841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice improvement to me -- thanks @pepijnve
bit_count
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve in Spark/JVM and Rust sometimes there are discrepancies, like treating decimals, regexp, etc.
please add tests for booleans T/F/null
Yep, I understand that. What was a bit puzzling initially was that there was no escription of what was actually different and why the port of the Java “count ones” implementation was being added. The difference was that the original DataFusion implementation was operating on the native size of the signed integer input values, while Spark always operates on Java long (i.e. i64). For unsigned and non negative signed integers that not an issue since the answer is the same. For negative integers though you get a different result since those are padded with There’s absolutely no need for a custom popcount implementation to fix this. Just widen to i64 and use count_ones.
That code path was not touched in this PR at all. Not sure why I should add tests for code that’s not being added or modified. |
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this algorithm is a SWAR hamming weight implementation. Per the code comments in java.lang.Long, this comes from Hacker's Delight.
What's interesting is that the Rust compiler generates something very similar when calling count_ones().
With a sufficiently recent target architecture though you get popcnt instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The popcnt instruction is crazy fast -- we tested it in one example where we had a special codepath for no nulls, and I was worried that calculating the test if nulls.count_ones() == 0 would overwhelm the improvement
Nowhere close. 🚀
Hackers Delight is a classic -- I am not at all surprised that the Rust compiler includes all those tricks (and then some!)
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor 👍
The good practice is to keep test set as complete as possible before refactoring, that was the reason to ask add missing bool tests. I added them in #18871 |
I understand what you're saying, but I don't agree with placing that burden on people when making unrelated changes. "You touched a file, so please increase the code coverage for unmodified code paths first" seems a bit contributor hostile. If I was changing the boolean code path I would fully agree that you make test first, then refactor, to make sure you're not changing behaviour. |
|
Yeah, totally agree on isolated changes. |
Sorry @comphead I missed this |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change Follow up on #18841 <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Adding missing bool tests for bit_count <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Rationale for this change
Spark's
bit_countfunction always operators on 64-bit values, while the originalbit_countimplementation indatafusion_sparkoperated on the native size of the input value.In order to fix this a custom bit counting implementation was ported over from the Java Spark implementation. This isn't really necessary though. Widening signed integers to
i64and then usingi64::count_oneswill get you the exact same result and is less obscure.What changes are included in this PR?
Remove custom
bitcountlogic and usei64::count_onesinstead.Are these changes tested?
Covered by existing tests that were added for #18225
Are there any user-facing changes?
No