Introduce RFC for RDB format #28

murphyjacob4 · 2025-01-23T22:39:10Z

No description provided.

Signed-off-by: Jacob Murphy <[email protected]>

rfc/rdb-format.md

PingXie · 2025-01-24T02:39:30Z

rfc/rdb-format.md

+
+## Motivation
+
+Our existing RDB format is a good start, but it also is fairly rigid, not supporting new types of data being stored in the RDB other than index definitions and index contents.


I think we need to articulate the real use cases here. Being rigid on its own is neutral IMO - we do want the car's frame to be rigid :)

There are items that have been discussed that could all lead to changes in the RDB format.

Save operations that retain the inflight-state of the ingestion pipeline.

Support for sharing of data between the search module and core (initially Vector data, but potentially other data) -- aka "dedup".

Unlike a car frame which is replaced when you buy a new car - we will need to support the same RDB format for the foreseeable future. It is more like if the same car frame for the 2000 Honda Civic had to work for the 2001, 2002, 2003, etc :)

But let me add some examples here. Allen has a good start

rfc/rdb-format.md

allenss-amazon

In general I'd like to see this approached more abstractly. I think as outlined here that it's relatively easy to add new 1-dimensional sections. But I believe we might need the ability to add multi-dimensional sections in the future, for example, suppose I wanted to insert a two dimensional table into the RDB. Am I forced to manually serialize this into a series of strings or can we have a format that's more extensible. Examples include RESP, JSON, etc.

rfc/rdb-format.md

allenss-amazon · 2025-01-28T00:45:10Z

rfc/rdb-format.md

+message RDBSection {
+   RDBSectionType type = 1;
+   bool required;
+   uint32 encoding_version;
+   oneof contents {
+      IndexSchema index_schema_contents = 2;
+      ...
+   };
+   uint32 supplemental_count = 3;


I think the per-section definition should be more generic than a protobuf. For example, this could make streaming of a new section essentially impossible. I think it makes more sense to generically design a new section as something easily extensible like an array of RDB strings. This is easily skipped by any code that doesn't understand the leading enum type-marker. It easily supports streaming and nothing prevents a new section from being 1 string that's interpreted as a protobuf when that's the right encoding.

I'll start by saying I am not opposed to dropping protobuf from the top-level. But for this proposal my goal was to encourage use of protocol buffers whenever possible to force contributors to think about backwards and forwards compatibility as we add to these RDB contents. This design is based on the assumption that having it all wrapped in protocol buffer will make using protocol buffers easier than not using protocol buffers. If the top level concept is a binary blob, the path of least resistance (and thus the defacto approach) for new content will be just to dump some strings into the RDB. If the top level concept is a protocol buffer, the path of least resistance will be including the contents into that protocol buffer, and adding binary blobs will be the exceptional case for when that doesn't make sense. For the reasons previously stated, I would prefer if the defacto approach is using protocol buffer, but I am open to counterpoints.

this could make streaming of a new section essentially impossible

The supplemental section is supposed to address the concerns about streaming of new sections. The supplemental section is essentially what you are proposing with the array of RDB strings (although the design uses an EOF marker for simplicity, but conecptually similar)

First, I'm fully in favor of having protobufs in certain places -- they are a good solution for a wide range of problems. I'm objecting to having them be required everywhere because there are known places where this will result in 100's of millions of useless data copying serialization, etc. Which I think will ultimately have a material on load times -- which is bad for availability for all of us.

For example, the HNSW key fixup table will have 1 or two small values for each HSNW slot. Doing a protobuf for each and every slot will cost substantial performance. This is a case where the flexibility of protobufs is unlikely to be advantageous. With the current proposal the only way to recover this performance would be to batch up blocks of slots into a single blob, which complicates code unnecessarily, when simplicity of using the Raw RDB I/O functions is much faster and clearer.

If we found out that in the future we wanted to add to this struct (maybe by pre-computing 1/sqrt per slot -- just an example) we could easily just use a different supplement section opcode and write the table out twice.

allenss-amazon · 2025-01-28T00:48:42Z

rfc/rdb-format.md

+
+`message IndexSchema` follows from the existing definition in src/index_schema.proto.
+
+The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written.


Conceptually this is nice. However, at the implementation level I am concerned that as the number of "required" fields multiplies that the testing matrix becomes unaffordable. IMO, semantic versioning works well enough here.

Can you explain how you would remove "required" by introducing semantic versioning? I would also like to reduce the section header bloat if possible (for test matrix and for complexity reduction)

Here's one thought. Suppose that we tagged ALL headers with a minimum semver. Under the assumption that all headers -- even headers for sections that are skipped because they are unknown -- are visited. Then we can do a global MIN on all of the semver's that we encounter. If the current code base is below that semver, then it's not able to process this RDB.

I do dislike the section header bloat BUT it collapses the testing matrix for backward compatibility.

It also solves another problem. Image a section that has a feature added, that when you use it forces the 'required' bit for that section. So you have version A and B. What if you add even MORE functionality to that section, i.e., version C. the old code that supports A and B can't tell that it doesn't support C. Replacing the required with a semver solves that problem.

Yeah I think this is a good simplification. Basically, synthesize all the "Required" and "encoding version" headers into a single "minimum semantic version". Can we do the "max of all mins" on the writer side though? I think it should be pretty easy to synthesize it once per dump, rather than once per section. WDYT?

In practice, it probably makes sense to just have a ComputeMinimumSemanticVersion function that iterates all indexes and computes the minimum semantic version based on feature usage. This centralization of the logic probably makes it easier to understand than having to venture into each section's logic and understand how it computes the semantic version.

I think the "save" framework should force the individual sections to declare a semver. Then the framework can take care of doing the "max of all mins" and dump a value at the end. Similarly, the restore framework should tell each section the version it found as well as doing a global check at the end.

allenss-amazon · 2025-01-28T00:50:33Z

rfc/rdb-format.md

+`message IndexSchema` follows from the existing definition in src/index_schema.proto.
+
+The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written.
+


I don't the advantage in differentiating between protobuf content and supplemental content when it's trivially easy to have one uber-format that easily supports both types of data.

We could flatten the two - but my assumption here are that 90% of our data will belong to the index schema. What I like about the current proposal is that adding new sections to the index schema is supported natively by adding supplemental content sections. If we flatten this and have everything at the top level, we need some way to map payloads to index schemas. It's not hard to just dump the index schema name and DB number (which will compose a unique key for us to lookup the index schema), but I feel like given the majority of changes are likley to be part of the index schema contents, having first-class support of composition in the RDB format through supplemental sections will reduce the complexity.

An example might help. To me, the nested example seems less complex. But it may be a matter of preference:

Nested Flattened

RDBSection { type: RDB_SECTION_INDEX_SCHEMA required: true encoding_version: 1 index_schema_definition: { name: "my_index", db_num: 0, ... } supplemental_count: 1 } SupplementalContentHeader { type: SUPPLEMENTAL_CONTENT_MY_NEW_PAYLOAD required: ... encoding_version: 1 my_new_payload_header: {} } SupplementalContentChunk { contents: "[my_new_payload_chunk_1]" } SupplementalContentChunk { contents: "[my_new_payload_chunk_2]" } SupplementalContentChunk { contents: "[my_new_payload_chunk_3]" } ... SupplementalContentChunk { contents: "" (EOF) }

RDBSectionHeader { type: RDB_SECTION_INDEX_SCHEMA required: true encoding_version: 1 } /* Count of RDB strings */ 1 /* Serialized index schema proto */ "IndexSchema{name:\"my_index\", db_num: 0, ...}" RDBSectionHeader { type: RDB_SECTION_MY_NEW_PAYLOAD required: ... encoding_version: 1 } /* Count of RDB strings */ 10 /* Reference to what index this belongs to */ "my_index" /* name */ "0" /* db_num */ /* Now the contents begin */ "[my_new_payload_chunk_1]" "[my_new_payload_chunk_2]" "[my_new_payload_chunk_3]" ...

A few weeks back - we synced offline. We recognized there are various ways to implement this, and that we were okay with going either direction. Given that, I went ahead and proceeded with my plan in PR #68

For the concerns on sizing, I do have some data from the new format. I loaded 1000 random vectors of size 100 with a tag index, with all keys having "my_tag" as the tag field. I did this for both the new and old format and compared the size:

New: 1,343,045 Bytes Avg (3 trials [1350309, 1320849, 1357979])

Old: 1,320,786 Bytes Avg (3 trials [1306821, 1321885, 1333654])

So it seems it is roughly a 1.6% increase, which I think is acceptable given the benefits.

rfc/rdb-format.md

allenss-amazon · 2025-01-28T00:59:10Z

rfc/rdb-format.md

+
+#### Example: Adding Vector Quantization
+
+With the above design, suppose that we are substantially changing the index to support a vector quantization option on `FT.CREATE`. For simplicity, suppose this is just a boolean "on" or "off" flag.


I think if we do VQ, it'll present itself as a new algo, i.e., HNSW, FLAT and VQ (or something like that). That new vector index sub-type will likely have lots of blobs and special data structures. I doubt we can accurate predict that. Rather, we should focus on the more generic ability to add sections on a per-field-index basis.

re:VQ - The goal was just to demonstrate what happens when the index format may change. But it is a simplified example. I am not proposing this is how quantization would be added.

If you have a better example, happy to change it :)

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 added 3 commits January 23, 2025 22:33

RFC for RDB format

c9312f3

Signed-off-by: Jacob Murphy <[email protected]>

Grammar and typos

d7d9002

Signed-off-by: Jacob Murphy <[email protected]>

Adjust diagram

89e136a

Signed-off-by: Jacob Murphy <[email protected]>

PingXie reviewed Jan 24, 2025

View reviewed changes

rfc/rdb-format.md Show resolved Hide resolved

PingXie reviewed Jan 24, 2025

View reviewed changes

yairgott requested changes Jan 24, 2025

View reviewed changes

rfc/rdb-format.md Show resolved Hide resolved

rfc/rdb-format.md Outdated Show resolved Hide resolved

rfc/rdb-format.md Show resolved Hide resolved

rfc/rdb-format.md Outdated Show resolved Hide resolved

allenss-amazon requested changes Jan 28, 2025

View reviewed changes

murphyjacob4 added 4 commits January 28, 2025 18:29

Add chunking of binary dump

81cf8a7

Signed-off-by: Jacob Murphy <[email protected]>

Update based on discussion

d1d0ab2

Signed-off-by: Jacob Murphy <[email protected]>

Clarify Valkey module api calls

4f58dcf

Signed-off-by: Jacob Murphy <[email protected]>

Update design with semantic versioning, coding sample.

a057fe3

Signed-off-by: Jacob Murphy <[email protected]>

yairgott assigned murphyjacob4 Mar 6, 2025

murphyjacob4 mentioned this pull request Mar 13, 2025

Implement new RDB format #68

Merged

murphyjacob4 requested a review from yairgott March 27, 2025 21:14

yairgott approved these changes Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce RFC for RDB format #28

Introduce RFC for RDB format #28

murphyjacob4 commented Jan 23, 2025

PingXie Jan 24, 2025

allenss-amazon Jan 27, 2025 •

edited

Loading

murphyjacob4 Jan 28, 2025

allenss-amazon left a comment

allenss-amazon Jan 28, 2025

murphyjacob4 Jan 28, 2025

allenss-amazon Jan 29, 2025 •

edited

Loading

allenss-amazon Jan 28, 2025

murphyjacob4 Jan 28, 2025

allenss-amazon Jan 28, 2025

murphyjacob4 Feb 18, 2025

allenss-amazon Feb 18, 2025

allenss-amazon Jan 28, 2025

murphyjacob4 Jan 28, 2025

murphyjacob4 Mar 27, 2025

allenss-amazon Jan 28, 2025

murphyjacob4 Jan 28, 2025


		## Motivation

		Our existing RDB format is a good start, but it also is fairly rigid, not supporting new types of data being stored in the RDB other than index definitions and index contents.


		`message IndexSchema` follows from the existing definition in src/index_schema.proto.

		The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written.


		#### Example: Adding Vector Quantization

		With the above design, suppose that we are substantially changing the index to support a vector quantization option on `FT.CREATE`. For simplicity, suppose this is just a boolean "on" or "off" flag.

Introduce RFC for RDB format #28

Are you sure you want to change the base?

Introduce RFC for RDB format #28

Conversation

murphyjacob4 commented Jan 23, 2025

Choose a reason for hiding this comment

allenss-amazon Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenss-amazon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenss-amazon Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenss-amazon Jan 27, 2025 •

edited

Loading

allenss-amazon Jan 29, 2025 •

edited

Loading