Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected degradation of FPR when pattern_bits=4 #682

Open
kevkrist opened this issue Feb 26, 2025 · 5 comments
Open

Unexpected degradation of FPR when pattern_bits=4 #682

kevkrist opened this issue Feb 26, 2025 · 5 comments
Assignees
Labels
topic: bloom_filter Issues related to bloom_filter type: bug Something isn't working

Comments

@kevkrist
Copy link

I am encountering unexpected behavior when using cuco::bloom_filter with pattern_bits = 4. The false positive rate (FPR) degrades too dramatically when changing from pattern_bits = 8 with a constant 'load factor' (i.e., the fraction of bits set in the filter). The issue may be related to the bit pattern selection.

The following code demonstrates the issue:

#include <cuco/bloom_filter.cuh>
#include <iostream>
#include <thrust/count.h>
#include <thrust/device_vector.h>
#include <thrust/sequence.h>

// 'Blocked' filter policy with 8B blocks
using policy_t = cuco::default_filter_policy<cuco::xxhash_64<uint32_t>, uint64_t, 1>;
using bf_t =
  cuco::bloom_filter<uint32_t, cuco::extent<std::size_t>, cuda::thread_scope_device, policy_t>;
constexpr size_t bits_per_block   = 64;
constexpr uint32_t pattern_bits_A = 4;
constexpr uint32_t pattern_bits_B = 8;
constexpr size_t bits_per_key_A   = 2 * pattern_bits_A;
constexpr size_t bits_per_key_B   = 2 * pattern_bits_B;

int main()
{
  // Initialize non-overlapping build and probe key sets
  thrust::device_vector<uint32_t> build_keys(1U << 20U);
  thrust::device_vector<uint32_t> probe_keys(1U << 25U);
  thrust::device_vector<bool> flags_A(1U << 25U, false);
  thrust::device_vector<bool> flags_B(1U << 25U, false);
  thrust::sequence(build_keys.begin(), build_keys.end(), 0, 2);
  thrust::sequence(probe_keys.begin(), probe_keys.end(), 1, 2);

  // Specify pattern bits for the policy
  policy_t policy_A(pattern_bits_A);
  bf_t filter_A(cuda::ceil_div(bits_per_key_A * build_keys.size(), bits_per_block), {}, policy_A);
  filter_A.add(build_keys.begin(), build_keys.end());
  filter_A.contains(probe_keys.begin(), probe_keys.end(), flags_A.begin());
  size_t fps_A   = thrust::count(flags_A.begin(), flags_A.end(), true);
  double_t fpr_A = 100.0 * fps_A / flags_A.size();
  std::cout << "FPR A: " << fpr_A << "\n";

  policy_t policy_B(pattern_bits_B);
  bf_t filter_B(cuda::ceil_div(bits_per_key_B * build_keys.size(), bits_per_block), {}, policy_B);
  filter_B.add(build_keys.begin(), build_keys.end());
  filter_B.contains(probe_keys.begin(), probe_keys.end(), flags_B.begin());
  size_t fps_B   = thrust::count(flags_B.begin(), flags_B.end(), true);
  double_t fpr_B = 100.0 * fps_B / flags_B.size();
  std::cout << "FPR B: " << fpr_B << "\n";

  return 0;
}

Observed Behavior:

FPR A: 16.9311
FPR B: 0.611573

Expected Behavior:
The FPR should increase more smoothly with decreasing pattern_bits / filter size. This configuration of 8B blocks with 4 bits being set per key is common (arrow/acero) and is not expected to produce such a high FPR with a 'load factor' of 0.5.

Environment:

  • Cuco version: 0.0.1
  • CUDA version: 12.2
  • Compiler: gcc 11.4.0
  • GPU: L4
  • OS: Ubuntu

Would appreciate any insights into what might be causing this! Or, if I'm missing something. Thanks!

@PointKernel PointKernel added the topic: bloom_filter Issues related to bloom_filter label Feb 26, 2025
@sleeepyjack
Copy link
Collaborator

Hi @kevkrist,

Thanks for raising this issue.
I'm looking into it.

I suspect two possible reasons where this bug could originate from:

  1. I made some recent changes to the bulk add kernel in Bloom filter optimizations (4/5): Implement device bulk add #672 which aren't perfectly covered by our unit tests. I opened [ENHANCEMENT]: Better tests for operations that allow varying CG sizes. #676 and I'm currently working to get these tests upstreamed.
  2. It could be that I messed up the logic that generates the bit pattern for a key here. I could use a second pair of eyes to verify that the logic is actually correct.

I'll send an update as soon as I find the culprit.

@sleeepyjack sleeepyjack added the type: bug Something isn't working label Feb 27, 2025
@sleeepyjack sleeepyjack self-assigned this Feb 27, 2025
@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Feb 27, 2025

I opened #683 and it seems that option 1 from my above post (regression after #672) is out of the race.

@kevkrist
Copy link
Author

kevkrist commented Feb 27, 2025

The pattern bits look correctly set to me. I do have one suspicion: this filter configuration only uses the last 24 bits of the hash (4 bits per word, 6 bits to shift the hash at each bit-setting iteration), unless I'm mistaken. These are also the hash bits that are predominantly used to determine the block index into which to insert the pattern. So, it's likely that keys that get hashed to the same block also have the same bit pattern.

This suspicion is partly confirmed with the following change to code in word_pattern (line 100 ff.):

    for (int32_t bit = 0; bit < bits_per_word; ++bit) {
      hash = cuda::std::rotl(hash, (int)bit_index_width);
      word |= word_type{1} << (hash & bit_index_mask);
      // hash >>= bit_index_width;
    }

Here I'm simply using the upper 24 bits of the hash to set the bits for the pattern. The output is now:

FPR A: 3.36147
FPR B: 0.49789

which is much better. Not sure yet if this is ideal. Let me know what you think.

EDIT: It might be a good idea, in general, to decouple (as much as possible) the hash bits for block indexing from those for bit pattern selection.

@sleeepyjack
Copy link
Collaborator

Thanks @kevkrist!

Yeah, for this specific parameter combination this approach brings down the FPR a lot. However, it seems that it performs worse in other cases.

It might be a good idea, in general, to decouple (as much as possible) the hash bits for block indexing from those for bit pattern selection.

I totally agree, and I've been aware of this concept for a long time. This is where I messed up in the implementation :)

I tried a few other things such as double hashing with a cheap secondary hash function such as a Murmur integer finalizer, but I'm not happy with the results yet. Maybe I should give multiplicative hashing a try 🤔

Also, I'm working on #685 for sanity checking the FPR with our test suite.

@kevkrist
Copy link
Author

Sounds great! Let me know what you end up deciding to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: bloom_filter Issues related to bloom_filter type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants