Skip to content
Kevin Karpenske edited this page Aug 10, 2015 · 17 revisions

Miru

Why we built it

There are several well known open source projects which provide sharding and replication on top of Lucene. Jive has used or is currently using SenseiDB, Solr, Katta, and ElasticSearch. They work, but none of them are a good fit for multi-tenant datasets. For example, adding multi-tenancy through use of a "tenant ID" field filter in every query has inherent performance implications. Our tenants range in size from tens to hundreds of thousands of users, and by sharing the same indexes we find ourselves in a situation where searching the smallest tenant incurs the same cost as searching the largest tenant. By including a tenant field we force every search to traverse to the end of each field index, because the tenant always constitutes the longest index.

Put simply, current open source search solutions lack any notion of tenancy. This shortcoming adds to latency because a query is only complete when the last item in the shortest inverted index has been found in the longest inverted index.

forced-to-end-of-posting-list


Here is a short list of features that existing technologies do not support that Miru provides:

  • Handle out of order updates (e.g. deletion received before creation) (per-document version order)
  • Handle partial document updates
  • Isolate tenants and scale them individually
  • Expand or contract capacity to support additional or fewer queries per second on a per-tenant basis
  • Expand or contract partitions for a given tenant
  • Move a specific tenant off of an overloaded node
  • Remove a tenant without having to rebuild the entire index
  • Merge and split a given tenants partitions to adjust to node size
  • Track index traits per tenant
  • Support field by field updates
  • Support document term frequency per tenant (Miru roadmap)

In the process of digging through Lucene and its index data structures it was immediately apparent how much of Lucene and other Lucene-backed technologies are built around scoring. Scoring makes a lot of sense when you have a large number of items that are intrinsically unordered. However, this is not the case with streams, where documents are always viewed in time-descending order.

If you remove all of the scoring concerns from Lucene you are left with bitsets and the ability to combine them using boolean operations (AND, OR, AND NOT, etc.). To mirror this capability, we found and focused our efforts on a pair of compressed bitmap libraries: JavaEWAH and Roaring Bitmap. With these libraries at our disposal, we had the necessary foundation to build a bit collider for inverted indexes.

This is a break down of the various bitset distinctions we have and the questions they allow us to answer:

  • Terms or Followables: Each term is a followable entity that gets its own bitset. The bitset is a mapping of the activities which contained the followable entity. In search parlance these bitsets are typically called inverted indexes. Using these bitsets we can build custom streams via boolean composition (AND, OR, AND NOT) of a large number of followable bitsets. When walked backwards, the resulting aggregate gives us the activity IDs with which we can compose a dynamic activity stream.
  • AuthZ: These entities associate a specific authz "tag" (generally an ACL qualifier) with the privileged activities. Combining these bitsets with followable bitsets allows us to do permission checks at query time. This also mean counts are based on permissions as well. The authz index is designed for efficient mutation in response to permission changes.
  • Time Range: These are created on the fly by creating a mask based on the time ordering of the index. They give us the ability to build streams or search for results within a specific time range.
  • Unread: These are used to track activities that an individual user has not read. This is a huge win over late-bound forward-lookup strategies. Ultimately this means that we can track what a user has or has not read using very little space. To store the fact that a user has not read 1,000,000 activities in an uncompressed form costs about 125 KB versus a minimum of 8 MB via forward lookups for 8-byte identifiers. Relative to the total size of an index, we consider this affordable.
  • Inbox: These keep track of which activities contribute to a user's inbox. Similar to the "unread" index, these are maintained per user. Because an individual user's inbox is relatively sparse against the total size of the index, these bitsets are extremely affordable.

Using these indexes, we know that we can take approximately 10,000 bitsets and join them in about 50 milliseconds, and that the result will be precisely the stream of activities that we care about. This is huge improvement! Because this is so fast, we can solve a problem that plagues most persistent per-user subsets of larger streams.

The dreaded question: Do you update each user's subset (i.e. inbox) at write time or at read time?

Miru allows us to address this problem at read time. We accomplish this by keeping track of the last indexed activity in a given user's inbox (a high water mark). By joining a time range bitset for all of the activity added since a user's last inbox index with all of the user's inbox followables. The resulting bitset can be directly appended to the user's inbox, which means we can bring a user's inbox up to date in about 50 milliseconds for every million unseen activities. In parallel, the user's unread bitset can be brought up to date by replaying read/unread events for the corresponding time period. We have effectively built a system that intelligently separates what is required at write time from what can be deferred until read time.


Now all that remains is to write something that manages partitioning and replication of all these bits sets. The sum of these parts is what became Miru. When developing the system, we kept in mind a number of critical requirements:

  • The ability to dynamically follow hundreds of thousands of people, places, and/or content (i.e. followables). A collection of followable things is called a stream. Streams have a further distinction of being either a “Custom Stream” or an “Inbox.”
  • The ability for each user to have an “Inbox” stream which durably records what has happened to its followables (that is, changing the followables for an inbox should not change the history of the inbox).
  • The ability to maintain a per-user record of what has or has not been read within a given user's “Inbox” stream.
  • The ability to aggregate all activities around a give followable down to one visible entity (a "parent" or "roll-up").
  • The ability to filter streams by tags, author, type, etc.
  • The ability to count the number of aggregate activities on a followable that have occurred since an arbitrary point in time within a given stream.
  • The ability to return only those results which a particular user is authorized to see.
  • The ability to perform full-text searches of streams.
  • The ability to generate a waveform from the result set of any query looking back and across dynamic time ranges.
Clone this wiki locally