Cuckoo filter #665

cesarcolle · 2018-08-28T20:50:51Z

From the issue #560
A first iteration of a CuckooFilter.

I use all the "majors" tests from BloomFilter.

 property("CuckooFilter  is a Monoid") {
    commutativeMonoidLaws[CF[String]]
  }

  property("++ is the same as plus") {
    forAll { (a: CF[String], b: CF[String]) =>
      Equiv[CF[String]].equiv(a ++ b, cfMonoid.plus(a, b))
    }
  }

  property("+ is the same as adding with create") {
    forAll { (a: CF[String], b: String) =>
      Equiv[CF[String]].equiv(a + b, cfMonoid.plus(a, cfMonoid.create(b)))
    }
  }
  property("a ++ a = a for CF") {
    forAll { (a: CF[String]) =>
      Equiv[CF[String]].equiv(a ++ a, a)
    }

you can use like :

      val cfMonoid = new CuckooFilterMonoid[String](254)
      val cuckoo = cfMonoid.create("Aline", "Aline", "pour", "qu'elle", "revienne" )
      cuckoo.lookup("Aline")

I have seen complicate project around the cuckoo-filter but seem like the asymptotic behavior of the cuckoo filter allow to simplify the code.

If it's seems ok for you I can keep adding new features.

…with monoid operator

johnynek · 2018-08-28T21:29:15Z

Thank you!

I'll post a review in the next day or two.

codecov-io · 2018-08-29T08:50:26Z

Codecov Report

Merging #665 into develop will decrease coverage by 0.05%.
The diff coverage is 78.08%.

@@             Coverage Diff             @@
##           develop     #665      +/-   ##
===========================================
- Coverage    89.31%   89.25%   -0.06%     
===========================================
  Files          113      114       +1     
  Lines         8944     9090     +146     
  Branches       490      519      +29     
===========================================
+ Hits          7988     8113     +125     
- Misses         956      977      +21

Impacted Files	Coverage Δ
...main/scala/com/twitter/algebird/CuckooFilter.scala	`78.08% <78.08%> (ø)`
...om/twitter/algebird/util/summer/AsyncListSum.scala	`95.45% <0%> (-2.28%)`	⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala	`92.21% <0%> (-0.78%)`	⬇️
.../main/scala/com/twitter/algebird/Applicative.scala	`58.82% <0%> (ø)`	⬆️
.../main/scala/com/twitter/algebird/BloomFilter.scala	`94.69% <0%> (+0.44%)`	⬆️
...src/main/scala/com/twitter/algebird/Interval.scala	`80% <0%> (+3.47%)`	⬆️
.../main/scala/com/twitter/algebird/Successible.scala	`91.66% <0%> (+4.16%)`	⬆️
...ala/com/twitter/algebird/ApproximateProperty.scala	`82% <0%> (+10%)`	⬆️
...scala/com/twitter/algebird/PredecessibleLaws.scala	`86.66% <0%> (+20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fdb079...ee53492. Read the comment docs.

…efinition of monoid + add Aggregator for CuckooFilter The properties are checked for all instance of the CuckooFilter : * Dense * Item * Zero A test for having same example as BloomFilter.

cesarcolle · 2018-09-02T19:12:04Z

The cuckoo filter can be used as BloomFilter
i.e :

      val cfMonoid1 = new CuckooFilterMonoid[String](32, 256)
      val cf1 = cfMonoid1.create("1", "2", "3", "4", "100")
      val lookup = cf1.lookup("1")
      assert(lookup)

There is not enough resources around cuckoo filter for : approximation size, optimal value, ... But theoretically, it's better than Bloomfilter. perhaps a mapping between bloom filter estimation and cuckoo filter ?

cesarcolle · 2018-09-04T13:28:28Z

This is now ready for review :) @johnynek

Thanks !

nevillelyh

I'm not sure I understand the impl well enough but it seems quite different from the paper and some other impls I've seen. Also not sure if a Semigroup approach makes sense given the non-deterministic nature of CF. I managed to get 100M insertions in ~2min with this Java impl and we're most likely going with that approach in our code.

nevillelyh · 2019-07-02T14:10:42Z

algebird-core/src/main/scala/com/twitter/algebird/CuckooFilter.scala

+ * TODO : Lookup method have to return a Approximate number like the size method (sometimes you can't insert an element).
+  **/
+object CuckooFilter {
+  def apply[A](fingerprintPerBucket: Int, buckets: Int = 256)(


The original paper, as well as this Java impl uses 2/4/8 fingerprints per bucket. Looks like you're using 10 & 50 in tests, any particular reason?

nevillelyh · 2019-07-02T14:11:34Z

algebird-core/src/main/scala/com/twitter/algebird/CuckooFilter.scala

+ *  - https://github.com/irfansharif/cfilter
+ * By nature, this filter isn't commutative
+ * From the inital paper, there is no problem to consider || fingerprint|| = ln(N) where N = fingerprintPerBucket * totalBucket
+ * "" as long as we use reasonably sized buckets, the fingerprint size can remain small. "", we'll use a a 32 bits fingerprint.


Original paper suggests that fingerprint is best below 8 bits. Any reason you're using 32?

nevillelyh · 2019-07-02T14:13:51Z

algebird-core/src/main/scala/com/twitter/algebird/CuckooFilter.scala

+  override def sumOption(iter: TraversableOnce[CF[A]]): Option[CF[A]] =
+    if (iter.isEmpty) None
+    else {
+      val buckets = Array.fill[CBitSet](totalBuckets)(new CBitSet(fingerprintBucket))


It's more efficient to use 1 bitset of numBuckets * numEntriesPerBucket * numBitsPerFingerprint? IIRC JVM pointers are 8 bytes, so there's totalBuckets * 8 overhead here, plus those from CBitSet and memory lookups.

According to the paper, each bucket should have numEntriesPerBucket slots, each with numBitsPerFingerprint. how does fingerprintBucket fit in here?

nevillelyh · 2019-07-02T14:27:41Z

algebird-core/src/main/scala/com/twitter/algebird/CuckooFilter.scala

+      var sets = 0
+
+      @inline def setFingerprint(index: Int, fp: Int): Unit = {
+        buckets(index).set(fp)


This setting a single bit? Looking at def fingerprint(), it's in [0, Int.MaxValue] and >>> fingerprintBucket, so this would overflow?

CLAassistant · 2019-11-16T00:33:27Z

All committers have signed the CLA.

cesarcolle added 16 commits August 22, 2018 01:36

Init cuckoo filter in a monoid way

87ff879

Add test skeleton + zero, sparse, dense cuckoo filter implementation …

73a74c2

…with monoid operator

Add fingerprint generation + hash value with XOR

b8074cf

swap data + insert data + kick fingerprint

39d5f06

reformat fingerprint

bc40180

Add test on kick + add create Monoid CuckooFilter

1b2e05d

replace fingerprintbits with default value

7cb46c9

Add kickOff elem + delete elem

2a35049

reformat code

563f41b

Add the delete et - operator for BloomFilter

b070abd

refactor

6ccd5b7

Add the monoid sum of element to create monoid

6a20edd

Add the generator for CFInstance

e68a031

CuckooFilter IS a monoid

a6c7248

Refactor + add simple test on cuckoofilter operator

ab2c8f8

Add information doc + todo list

6ca46a5

cesarcolle closed this Aug 28, 2018

clean code

ea1722a

cesarcolle reopened this Aug 28, 2018

Reformat code with algebird's scalafmt

332095a

cesarcolle closed this Aug 29, 2018

cesarcolle reopened this Aug 29, 2018

Add the CuckooFilter API like bloomFilter with more test + refactor d…

b8d4162

…efinition of monoid + add Aggregator for CuckooFilter The properties are checked for all instance of the CuckooFilter : * Dense * Item * Zero A test for having same example as BloomFilter.

cesarcolle changed the title ~~Cuckoo filter~~ WIP : Cuckoo filter Sep 2, 2018

cesarcolle changed the title ~~WIP : Cuckoo filter~~ Cuckoo filter Sep 4, 2018

Add benchmark for cuckoo filter on create + querying the cuckoo filter

ee53492

nevillelyh reviewed Jul 2, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuckoo filter #665

Cuckoo filter #665

cesarcolle commented Aug 28, 2018

johnynek commented Aug 28, 2018

codecov-io commented Aug 29, 2018 •

edited

Loading

cesarcolle commented Sep 2, 2018

cesarcolle commented Sep 4, 2018

nevillelyh left a comment

nevillelyh Jul 2, 2019

nevillelyh Jul 2, 2019

nevillelyh Jul 2, 2019

nevillelyh Jul 2, 2019

CLAassistant commented Nov 16, 2019 •

edited

Loading

Cuckoo filter #665

Are you sure you want to change the base?

Cuckoo filter #665

Conversation

cesarcolle commented Aug 28, 2018

johnynek commented Aug 28, 2018

codecov-io commented Aug 29, 2018 • edited Loading

Codecov Report

cesarcolle commented Sep 2, 2018

cesarcolle commented Sep 4, 2018

nevillelyh left a comment

Choose a reason for hiding this comment

nevillelyh Jul 2, 2019

Choose a reason for hiding this comment

nevillelyh Jul 2, 2019

Choose a reason for hiding this comment

nevillelyh Jul 2, 2019

Choose a reason for hiding this comment

nevillelyh Jul 2, 2019

Choose a reason for hiding this comment

CLAassistant commented Nov 16, 2019 • edited Loading

codecov-io commented Aug 29, 2018 •

edited

Loading

CLAassistant commented Nov 16, 2019 •

edited

Loading