-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuckoo filter #665
base: develop
Are you sure you want to change the base?
Cuckoo filter #665
Conversation
…with monoid operator
Thank you! I'll post a review in the next day or two. |
Codecov Report
@@ Coverage Diff @@
## develop #665 +/- ##
===========================================
- Coverage 89.31% 89.25% -0.06%
===========================================
Files 113 114 +1
Lines 8944 9090 +146
Branches 490 519 +29
===========================================
+ Hits 7988 8113 +125
- Misses 956 977 +21
Continue to review full report at Codecov.
|
…efinition of monoid + add Aggregator for CuckooFilter The properties are checked for all instance of the CuckooFilter : * Dense * Item * Zero A test for having same example as BloomFilter.
The cuckoo filter can be used as BloomFilter val cfMonoid1 = new CuckooFilterMonoid[String](32, 256)
val cf1 = cfMonoid1.create("1", "2", "3", "4", "100")
val lookup = cf1.lookup("1")
assert(lookup) There is not enough resources around cuckoo filter for : approximation size, optimal value, ... But theoretically, it's better than Bloomfilter. perhaps a mapping between bloom filter estimation and cuckoo filter ? |
This is now ready for review :) @johnynek Thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the impl well enough but it seems quite different from the paper and some other impls I've seen. Also not sure if a Semigroup
approach makes sense given the non-deterministic nature of CF. I managed to get 100M insertions in ~2min with this Java impl and we're most likely going with that approach in our code.
* TODO : Lookup method have to return a Approximate number like the size method (sometimes you can't insert an element). | ||
**/ | ||
object CuckooFilter { | ||
def apply[A](fingerprintPerBucket: Int, buckets: Int = 256)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original paper, as well as this Java impl uses 2/4/8 fingerprints per bucket. Looks like you're using 10 & 50 in tests, any particular reason?
* - https://github.com/irfansharif/cfilter | ||
* By nature, this filter isn't commutative | ||
* From the inital paper, there is no problem to consider || fingerprint|| = ln(N) where N = fingerprintPerBucket * totalBucket | ||
* "" as long as we use reasonably sized buckets, the fingerprint size can remain small. "", we'll use a a 32 bits fingerprint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original paper suggests that fingerprint is best below 8 bits. Any reason you're using 32?
override def sumOption(iter: TraversableOnce[CF[A]]): Option[CF[A]] = | ||
if (iter.isEmpty) None | ||
else { | ||
val buckets = Array.fill[CBitSet](totalBuckets)(new CBitSet(fingerprintBucket)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more efficient to use 1 bitset of numBuckets * numEntriesPerBucket * numBitsPerFingerprint
? IIRC JVM pointers are 8 bytes, so there's totalBuckets * 8
overhead here, plus those from CBitSet
and memory lookups.
According to the paper, each bucket should have numEntriesPerBucket
slots, each with numBitsPerFingerprint
. how does fingerprintBucket
fit in here?
var sets = 0 | ||
|
||
@inline def setFingerprint(index: Int, fp: Int): Unit = { | ||
buckets(index).set(fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting a single bit? Looking at def fingerprint()
, it's in [0, Int.MaxValue]
and >>> fingerprintBucket
, so this would overflow?
From the issue #560
A first iteration of a CuckooFilter.
I use all the "majors" tests from BloomFilter.
you can use like :
I have seen complicate project around the cuckoo-filter but seem like the asymptotic behavior of the cuckoo filter allow to simplify the code.
If it's seems ok for you I can keep adding new features.