-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce RFC for RDB format #28
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
rfc/rdb-format.md
Outdated
|
||
## Motivation | ||
|
||
Our existing RDB format is a good start, but it also is fairly rigid, not supporting new types of data being stored in the RDB other than index definitions and index contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to articulate the real use cases here. Being rigid on its own is neutral IMO - we do want the car's frame to be rigid :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are items that have been discussed that could all lead to changes in the RDB format.
- Save operations that retain the inflight-state of the ingestion pipeline.
- Support for sharing of data between the search module and core (initially Vector data, but potentially other data) -- aka "dedup".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike a car frame which is replaced when you buy a new car - we will need to support the same RDB format for the foreseeable future. It is more like if the same car frame for the 2000 Honda Civic had to work for the 2001, 2002, 2003, etc :)
But let me add some examples here. Allen has a good start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd like to see this approached more abstractly. I think as outlined here that it's relatively easy to add new 1-dimensional sections. But I believe we might need the ability to add multi-dimensional sections in the future, for example, suppose I wanted to insert a two dimensional table into the RDB. Am I forced to manually serialize this into a series of strings or can we have a format that's more extensible. Examples include RESP, JSON, etc.
rfc/rdb-format.md
Outdated
message RDBSection { | ||
RDBSectionType type = 1; | ||
bool required; | ||
uint32 encoding_version; | ||
oneof contents { | ||
IndexSchema index_schema_contents = 2; | ||
... | ||
}; | ||
uint32 supplemental_count = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the per-section definition should be more generic than a protobuf. For example, this could make streaming of a new section essentially impossible. I think it makes more sense to generically design a new section as something easily extensible like an array of RDB strings. This is easily skipped by any code that doesn't understand the leading enum type-marker. It easily supports streaming and nothing prevents a new section from being 1 string that's interpreted as a protobuf when that's the right encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll start by saying I am not opposed to dropping protobuf from the top-level. But for this proposal my goal was to encourage use of protocol buffers whenever possible to force contributors to think about backwards and forwards compatibility as we add to these RDB contents. This design is based on the assumption that having it all wrapped in protocol buffer will make using protocol buffers easier than not using protocol buffers. If the top level concept is a binary blob, the path of least resistance (and thus the defacto approach) for new content will be just to dump some strings into the RDB. If the top level concept is a protocol buffer, the path of least resistance will be including the contents into that protocol buffer, and adding binary blobs will be the exceptional case for when that doesn't make sense. For the reasons previously stated, I would prefer if the defacto approach is using protocol buffer, but I am open to counterpoints.
this could make streaming of a new section essentially impossible
The supplemental section is supposed to address the concerns about streaming of new sections. The supplemental section is essentially what you are proposing with the array of RDB strings (although the design uses an EOF marker for simplicity, but conecptually similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I'm fully in favor of having protobufs in certain places -- they are a good solution for a wide range of problems. I'm objecting to having them be required everywhere because there are known places where this will result in 100's of millions of useless data copying serialization, etc. Which I think will ultimately have a material on load times -- which is bad for availability for all of us.
For example, the HNSW key fixup table will have 1 or two small values for each HSNW slot. Doing a protobuf for each and every slot will cost substantial performance. This is a case where the flexibility of protobufs is unlikely to be advantageous. With the current proposal the only way to recover this performance would be to batch up blocks of slots into a single blob, which complicates code unnecessarily, when simplicity of using the Raw RDB I/O functions is much faster and clearer.
If we found out that in the future we wanted to add to this struct (maybe by pre-computing 1/sqrt per slot -- just an example) we could easily just use a different supplement section opcode and write the table out twice.
rfc/rdb-format.md
Outdated
|
||
`message IndexSchema` follows from the existing definition in src/index_schema.proto. | ||
|
||
The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually this is nice. However, at the implementation level I am concerned that as the number of "required" fields multiplies that the testing matrix becomes unaffordable. IMO, semantic versioning works well enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how you would remove "required" by introducing semantic versioning? I would also like to reduce the section header bloat if possible (for test matrix and for complexity reduction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one thought. Suppose that we tagged ALL headers with a minimum semver. Under the assumption that all headers -- even headers for sections that are skipped because they are unknown -- are visited. Then we can do a global MIN on all of the semver's that we encounter. If the current code base is below that semver, then it's not able to process this RDB.
I do dislike the section header bloat BUT it collapses the testing matrix for backward compatibility.
It also solves another problem. Image a section that has a feature added, that when you use it forces the 'required' bit for that section. So you have version A and B. What if you add even MORE functionality to that section, i.e., version C. the old code that supports A and B can't tell that it doesn't support C. Replacing the required with a semver solves that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is a good simplification. Basically, synthesize all the "Required" and "encoding version" headers into a single "minimum semantic version". Can we do the "max of all mins" on the writer side though? I think it should be pretty easy to synthesize it once per dump, rather than once per section. WDYT?
In practice, it probably makes sense to just have a ComputeMinimumSemanticVersion function that iterates all indexes and computes the minimum semantic version based on feature usage. This centralization of the logic probably makes it easier to understand than having to venture into each section's logic and understand how it computes the semantic version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "save" framework should force the individual sections to declare a semver. Then the framework can take care of doing the "max of all mins" and dump a value at the end. Similarly, the restore framework should tell each section the version it found as well as doing a global check at the end.
`message IndexSchema` follows from the existing definition in src/index_schema.proto. | ||
|
||
The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't the advantage in differentiating between protobuf content and supplemental content when it's trivially easy to have one uber-format that easily supports both types of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could flatten the two - but my assumption here are that 90% of our data will belong to the index schema. What I like about the current proposal is that adding new sections to the index schema is supported natively by adding supplemental content sections. If we flatten this and have everything at the top level, we need some way to map payloads to index schemas. It's not hard to just dump the index schema name and DB number (which will compose a unique key for us to lookup the index schema), but I feel like given the majority of changes are likley to be part of the index schema contents, having first-class support of composition in the RDB format through supplemental sections will reduce the complexity.
An example might help. To me, the nested example seems less complex. But it may be a matter of preference:
Nested | Flattened |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few weeks back - we synced offline. We recognized there are various ways to implement this, and that we were okay with going either direction. Given that, I went ahead and proceeded with my plan in PR #68
For the concerns on sizing, I do have some data from the new format. I loaded 1000 random vectors of size 100 with a tag index, with all keys having "my_tag" as the tag field. I did this for both the new and old format and compared the size:
- New: 1,343,045 Bytes Avg (3 trials [1350309, 1320849, 1357979])
- Old: 1,320,786 Bytes Avg (3 trials [1306821, 1321885, 1333654])
So it seems it is roughly a 1.6% increase, which I think is acceptable given the benefits.
rfc/rdb-format.md
Outdated
|
||
#### Example: Adding Vector Quantization | ||
|
||
With the above design, suppose that we are substantially changing the index to support a vector quantization option on `FT.CREATE`. For simplicity, suppose this is just a boolean "on" or "off" flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we do VQ, it'll present itself as a new algo, i.e., HNSW, FLAT and VQ (or something like that). That new vector index sub-type will likely have lots of blobs and special data structures. I doubt we can accurate predict that. Rather, we should focus on the more generic ability to add sections on a per-field-index basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re:VQ - The goal was just to demonstrate what happens when the index format may change. But it is a simplified example. I am not proposing this is how quantization would be added.
If you have a better example, happy to change it :)
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
No description provided.