Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RFC for RDB format #28

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

murphyjacob4
Copy link
Collaborator

No description provided.

Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>

## Motivation

Our existing RDB format is a good start, but it also is fairly rigid, not supporting new types of data being stored in the RDB other than index definitions and index contents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to articulate the real use cases here. Being rigid on its own is neutral IMO - we do want the car's frame to be rigid :)

Copy link
Member

@allenss-amazon allenss-amazon Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are items that have been discussed that could all lead to changes in the RDB format.

  1. Save operations that retain the inflight-state of the ingestion pipeline.
  2. Support for sharing of data between the search module and core (initially Vector data, but potentially other data) -- aka "dedup".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike a car frame which is replaced when you buy a new car - we will need to support the same RDB format for the foreseeable future. It is more like if the same car frame for the 2000 Honda Civic had to work for the 2001, 2002, 2003, etc :)

But let me add some examples here. Allen has a good start

Copy link
Member

@allenss-amazon allenss-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd like to see this approached more abstractly. I think as outlined here that it's relatively easy to add new 1-dimensional sections. But I believe we might need the ability to add multi-dimensional sections in the future, for example, suppose I wanted to insert a two dimensional table into the RDB. Am I forced to manually serialize this into a series of strings or can we have a format that's more extensible. Examples include RESP, JSON, etc.

Comment on lines 88 to 96
message RDBSection {
RDBSectionType type = 1;
bool required;
uint32 encoding_version;
oneof contents {
IndexSchema index_schema_contents = 2;
...
};
uint32 supplemental_count = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the per-section definition should be more generic than a protobuf. For example, this could make streaming of a new section essentially impossible. I think it makes more sense to generically design a new section as something easily extensible like an array of RDB strings. This is easily skipped by any code that doesn't understand the leading enum type-marker. It easily supports streaming and nothing prevents a new section from being 1 string that's interpreted as a protobuf when that's the right encoding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start by saying I am not opposed to dropping protobuf from the top-level. But for this proposal my goal was to encourage use of protocol buffers whenever possible to force contributors to think about backwards and forwards compatibility as we add to these RDB contents. This design is based on the assumption that having it all wrapped in protocol buffer will make using protocol buffers easier than not using protocol buffers. If the top level concept is a binary blob, the path of least resistance (and thus the defacto approach) for new content will be just to dump some strings into the RDB. If the top level concept is a protocol buffer, the path of least resistance will be including the contents into that protocol buffer, and adding binary blobs will be the exceptional case for when that doesn't make sense. For the reasons previously stated, I would prefer if the defacto approach is using protocol buffer, but I am open to counterpoints.

this could make streaming of a new section essentially impossible

The supplemental section is supposed to address the concerns about streaming of new sections. The supplemental section is essentially what you are proposing with the array of RDB strings (although the design uses an EOF marker for simplicity, but conecptually similar)

Copy link
Member

@allenss-amazon allenss-amazon Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I'm fully in favor of having protobufs in certain places -- they are a good solution for a wide range of problems. I'm objecting to having them be required everywhere because there are known places where this will result in 100's of millions of useless data copying serialization, etc. Which I think will ultimately have a material on load times -- which is bad for availability for all of us.

For example, the HNSW key fixup table will have 1 or two small values for each HSNW slot. Doing a protobuf for each and every slot will cost substantial performance. This is a case where the flexibility of protobufs is unlikely to be advantageous. With the current proposal the only way to recover this performance would be to batch up blocks of slots into a single blob, which complicates code unnecessarily, when simplicity of using the Raw RDB I/O functions is much faster and clearer.

If we found out that in the future we wanted to add to this struct (maybe by pre-computing 1/sqrt per slot -- just an example) we could easily just use a different supplement section opcode and write the table out twice.


`message IndexSchema` follows from the existing definition in src/index_schema.proto.

The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this is nice. However, at the implementation level I am concerned that as the number of "required" fields multiplies that the testing matrix becomes unaffordable. IMO, semantic versioning works well enough here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how you would remove "required" by introducing semantic versioning? I would also like to reduce the section header bloat if possible (for test matrix and for complexity reduction)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one thought. Suppose that we tagged ALL headers with a minimum semver. Under the assumption that all headers -- even headers for sections that are skipped because they are unknown -- are visited. Then we can do a global MIN on all of the semver's that we encounter. If the current code base is below that semver, then it's not able to process this RDB.

I do dislike the section header bloat BUT it collapses the testing matrix for backward compatibility.

It also solves another problem. Image a section that has a feature added, that when you use it forces the 'required' bit for that section. So you have version A and B. What if you add even MORE functionality to that section, i.e., version C. the old code that supports A and B can't tell that it doesn't support C. Replacing the required with a semver solves that problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is a good simplification. Basically, synthesize all the "Required" and "encoding version" headers into a single "minimum semantic version". Can we do the "max of all mins" on the writer side though? I think it should be pretty easy to synthesize it once per dump, rather than once per section. WDYT?

In practice, it probably makes sense to just have a ComputeMinimumSemanticVersion function that iterates all indexes and computes the minimum semantic version based on feature usage. This centralization of the logic probably makes it easier to understand than having to venture into each section's logic and understand how it computes the semantic version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "save" framework should force the individual sections to declare a semver. Then the framework can take care of doing the "max of all mins" and dump a value at the end. Similarly, the restore framework should tell each section the version it found as well as doing a global check at the end.

`message IndexSchema` follows from the existing definition in src/index_schema.proto.

The goal of breaking into sections is to support skipping optional sections if they are not understood. New sections should be introduced in a manner where failure to understand the new section will generally load fine without loss. Any time that failure to load the section would result in some form of lossiness or inconsistency, we will mark `required` as true and it will result in a downgrade failure. This would only be desired in cases where operators have used new features, and need to think about the downgrade more critically, potentially removing (or, once supported, altering) indexes that will not downgrade gracefully. Similarly, for section types that would like to introduce new required fields, we will include an encoding version, which is conditionally bumped when these new fields are written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't the advantage in differentiating between protobuf content and supplemental content when it's trivially easy to have one uber-format that easily supports both types of data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could flatten the two - but my assumption here are that 90% of our data will belong to the index schema. What I like about the current proposal is that adding new sections to the index schema is supported natively by adding supplemental content sections. If we flatten this and have everything at the top level, we need some way to map payloads to index schemas. It's not hard to just dump the index schema name and DB number (which will compose a unique key for us to lookup the index schema), but I feel like given the majority of changes are likley to be part of the index schema contents, having first-class support of composition in the RDB format through supplemental sections will reduce the complexity.

An example might help. To me, the nested example seems less complex. But it may be a matter of preference:

Nested Flattened

RDBSection {
   type: RDB_SECTION_INDEX_SCHEMA
   required: true
   encoding_version: 1
   index_schema_definition: {
      name: "my_index",
      db_num: 0,
      ...
   }
   supplemental_count: 1
}
SupplementalContentHeader {
   type: SUPPLEMENTAL_CONTENT_MY_NEW_PAYLOAD
   required: ...
   encoding_version: 1
   my_new_payload_header: {}
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_1]"
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_2]"
}
SupplementalContentChunk {
   contents: "[my_new_payload_chunk_3]"
}
...
SupplementalContentChunk {
   contents: "" (EOF)
}

RDBSectionHeader {
   type: RDB_SECTION_INDEX_SCHEMA
   required: true
   encoding_version: 1
}
/* Count of RDB strings */
1
/* Serialized index schema proto */
"IndexSchema{name:\"my_index\", db_num: 0, ...}"
RDBSectionHeader {
   type: RDB_SECTION_MY_NEW_PAYLOAD
   required: ...
   encoding_version: 1
}
/* Count of RDB strings */
10
/* Reference to what index this belongs to */
"my_index"  /* name */
"0"         /* db_num */
/* Now the contents begin */
"[my_new_payload_chunk_1]"
"[my_new_payload_chunk_2]"
"[my_new_payload_chunk_3]"
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few weeks back - we synced offline. We recognized there are various ways to implement this, and that we were okay with going either direction. Given that, I went ahead and proceeded with my plan in PR #68

For the concerns on sizing, I do have some data from the new format. I loaded 1000 random vectors of size 100 with a tag index, with all keys having "my_tag" as the tag field. I did this for both the new and old format and compared the size:

  • New: 1,343,045 Bytes Avg (3 trials [1350309, 1320849, 1357979])
  • Old: 1,320,786 Bytes Avg (3 trials [1306821, 1321885, 1333654])

So it seems it is roughly a 1.6% increase, which I think is acceptable given the benefits.


#### Example: Adding Vector Quantization

With the above design, suppose that we are substantially changing the index to support a vector quantization option on `FT.CREATE`. For simplicity, suppose this is just a boolean "on" or "off" flag.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do VQ, it'll present itself as a new algo, i.e., HNSW, FLAT and VQ (or something like that). That new vector index sub-type will likely have lots of blobs and special data structures. I doubt we can accurate predict that. Rather, we should focus on the more generic ability to add sections on a per-field-index basis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re:VQ - The goal was just to demonstrate what happens when the index format may change. But it is a simplified example. I am not proposing this is how quantization would be added.

If you have a better example, happy to change it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants