-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected degradation of FPR when pattern_bits=4
#682
Comments
Hi @kevkrist, Thanks for raising this issue. I suspect two possible reasons where this bug could originate from:
I'll send an update as soon as I find the culprit. |
The pattern bits look correctly set to me. I do have one suspicion: this filter configuration only uses the last 24 bits of the hash (4 bits per word, 6 bits to shift the hash at each bit-setting iteration), unless I'm mistaken. These are also the hash bits that are predominantly used to determine the block index into which to insert the pattern. So, it's likely that keys that get hashed to the same block also have the same bit pattern. This suspicion is partly confirmed with the following change to code in for (int32_t bit = 0; bit < bits_per_word; ++bit) {
hash = cuda::std::rotl(hash, (int)bit_index_width);
word |= word_type{1} << (hash & bit_index_mask);
// hash >>= bit_index_width;
} Here I'm simply using the upper 24 bits of the hash to set the bits for the pattern. The output is now:
which is much better. Not sure yet if this is ideal. Let me know what you think. EDIT: It might be a good idea, in general, to decouple (as much as possible) the hash bits for block indexing from those for bit pattern selection. |
Thanks @kevkrist! Yeah, for this specific parameter combination this approach brings down the FPR a lot. However, it seems that it performs worse in other cases.
I totally agree, and I've been aware of this concept for a long time. This is where I messed up in the implementation :) I tried a few other things such as double hashing with a cheap secondary hash function such as a Murmur integer finalizer, but I'm not happy with the results yet. Maybe I should give multiplicative hashing a try 🤔 Also, I'm working on #685 for sanity checking the FPR with our test suite. |
Sounds great! Let me know what you end up deciding to do. |
I am encountering unexpected behavior when using
cuco::bloom_filter
withpattern_bits = 4
. The false positive rate (FPR) degrades too dramatically when changing frompattern_bits = 8
with a constant 'load factor' (i.e., the fraction of bits set in the filter). The issue may be related to the bit pattern selection.The following code demonstrates the issue:
Observed Behavior:
Expected Behavior:
The FPR should increase more smoothly with decreasing
pattern_bits
/ filter size. This configuration of 8B blocks with 4 bits being set per key is common (arrow/acero) and is not expected to produce such a high FPR with a 'load factor' of0.5
.Environment:
Would appreciate any insights into what might be causing this! Or, if I'm missing something. Thanks!
The text was updated successfully, but these errors were encountered: