Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuckoo filter #665

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from
Open

Conversation

cesarcolle
Copy link

From the issue #560
A first iteration of a CuckooFilter.

I use all the "majors" tests from BloomFilter.

 property("CuckooFilter  is a Monoid") {
    commutativeMonoidLaws[CF[String]]
  }

  property("++ is the same as plus") {
    forAll { (a: CF[String], b: CF[String]) =>
      Equiv[CF[String]].equiv(a ++ b, cfMonoid.plus(a, b))
    }
  }

  property("+ is the same as adding with create") {
    forAll { (a: CF[String], b: String) =>
      Equiv[CF[String]].equiv(a + b, cfMonoid.plus(a, cfMonoid.create(b)))
    }
  }
  property("a ++ a = a for CF") {
    forAll { (a: CF[String]) =>
      Equiv[CF[String]].equiv(a ++ a, a)
    }

you can use like :

      val cfMonoid = new CuckooFilterMonoid[String](254)
      val cuckoo = cfMonoid.create("Aline", "Aline", "pour", "qu'elle", "revienne" )
      cuckoo.lookup("Aline")

I have seen complicate project around the cuckoo-filter but seem like the asymptotic behavior of the cuckoo filter allow to simplify the code.

If it's seems ok for you I can keep adding new features.

@cesarcolle cesarcolle closed this Aug 28, 2018
@cesarcolle cesarcolle reopened this Aug 28, 2018
@johnynek
Copy link
Collaborator

Thank you!

I'll post a review in the next day or two.

@cesarcolle cesarcolle closed this Aug 29, 2018
@cesarcolle cesarcolle reopened this Aug 29, 2018
@codecov-io
Copy link

codecov-io commented Aug 29, 2018

Codecov Report

Merging #665 into develop will decrease coverage by 0.05%.
The diff coverage is 78.08%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #665      +/-   ##
===========================================
- Coverage    89.31%   89.25%   -0.06%     
===========================================
  Files          113      114       +1     
  Lines         8944     9090     +146     
  Branches       490      519      +29     
===========================================
+ Hits          7988     8113     +125     
- Misses         956      977      +21
Impacted Files Coverage Δ
...main/scala/com/twitter/algebird/CuckooFilter.scala 78.08% <78.08%> (ø)
...om/twitter/algebird/util/summer/AsyncListSum.scala 95.45% <0%> (-2.28%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 92.21% <0%> (-0.78%) ⬇️
.../main/scala/com/twitter/algebird/Applicative.scala 58.82% <0%> (ø) ⬆️
.../main/scala/com/twitter/algebird/BloomFilter.scala 94.69% <0%> (+0.44%) ⬆️
...src/main/scala/com/twitter/algebird/Interval.scala 80% <0%> (+3.47%) ⬆️
.../main/scala/com/twitter/algebird/Successible.scala 91.66% <0%> (+4.16%) ⬆️
...ala/com/twitter/algebird/ApproximateProperty.scala 82% <0%> (+10%) ⬆️
...scala/com/twitter/algebird/PredecessibleLaws.scala 86.66% <0%> (+20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fdb079...ee53492. Read the comment docs.

…efinition of monoid + add Aggregator for CuckooFilter

The properties are checked for all instance of the CuckooFilter :

* Dense
* Item
* Zero

A test for having same example as BloomFilter.
@cesarcolle cesarcolle changed the title Cuckoo filter WIP : Cuckoo filter Sep 2, 2018
@cesarcolle
Copy link
Author

The cuckoo filter can be used as BloomFilter
i.e :

      val cfMonoid1 = new CuckooFilterMonoid[String](32, 256)
      val cf1 = cfMonoid1.create("1", "2", "3", "4", "100")
      val lookup = cf1.lookup("1")
      assert(lookup)

There is not enough resources around cuckoo filter for : approximation size, optimal value, ... But theoretically, it's better than Bloomfilter. perhaps a mapping between bloom filter estimation and cuckoo filter ?

@cesarcolle cesarcolle changed the title WIP : Cuckoo filter Cuckoo filter Sep 4, 2018
@cesarcolle
Copy link
Author

This is now ready for review :) @johnynek

Thanks !

Copy link
Collaborator

@nevillelyh nevillelyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the impl well enough but it seems quite different from the paper and some other impls I've seen. Also not sure if a Semigroup approach makes sense given the non-deterministic nature of CF. I managed to get 100M insertions in ~2min with this Java impl and we're most likely going with that approach in our code.

* TODO : Lookup method have to return a Approximate number like the size method (sometimes you can't insert an element).
**/
object CuckooFilter {
def apply[A](fingerprintPerBucket: Int, buckets: Int = 256)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original paper, as well as this Java impl uses 2/4/8 fingerprints per bucket. Looks like you're using 10 & 50 in tests, any particular reason?

* - https://github.com/irfansharif/cfilter
* By nature, this filter isn't commutative
* From the inital paper, there is no problem to consider || fingerprint|| = ln(N) where N = fingerprintPerBucket * totalBucket
* "" as long as we use reasonably sized buckets, the fingerprint size can remain small. "", we'll use a a 32 bits fingerprint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original paper suggests that fingerprint is best below 8 bits. Any reason you're using 32?

override def sumOption(iter: TraversableOnce[CF[A]]): Option[CF[A]] =
if (iter.isEmpty) None
else {
val buckets = Array.fill[CBitSet](totalBuckets)(new CBitSet(fingerprintBucket))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more efficient to use 1 bitset of numBuckets * numEntriesPerBucket * numBitsPerFingerprint? IIRC JVM pointers are 8 bytes, so there's totalBuckets * 8 overhead here, plus those from CBitSet and memory lookups.

According to the paper, each bucket should have numEntriesPerBucket slots, each with numBitsPerFingerprint. how does fingerprintBucket fit in here?

var sets = 0

@inline def setFingerprint(index: Int, fp: Int): Unit = {
buckets(index).set(fp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting a single bit? Looking at def fingerprint(), it's in [0, Int.MaxValue] and >>> fingerprintBucket, so this would overflow?

@CLAassistant
Copy link

CLAassistant commented Nov 16, 2019

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants