-
Notifications
You must be signed in to change notification settings - Fork 114
HNSW using the linear package #3691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4cfef2a to
3a04055
Compare
8fcb3a6 to
fc7994f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has obviously taken a bit of time, but this is part one of the review. It covers:
- The
HNSWclass and core algorithm - The
NodeandNodeKindclasses
Still yet to look at are:
- The
StorageAdapterand implementations - Change sets
- Any of the changes to the linear and RaBitQ packages
- All tests
As hopefully is clear in the review, a lot of what's in it are requests for clarification. Some of these should probably turn into comments.
I also think that it would be good to take another look at the teamscale findings. Most of those are also pretty minor, but it would be good to try to conform a bit more to them. I'm less concerned about things like method length, nesting, or number of parameters (especially for private methods), but it would be nice to take another look at them.
Overall, I think the approach makes sense, through. Nice!
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractNode.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractStorageAdapter.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AccessInfo.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AggregatedVector.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/BaseNeighborsChangeSet.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/linear/AffineOperator.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/linear/VectorOperator.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/linear/VectorOperator.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWHelpersTest.java
Outdated
Show resolved
Hide resolved
| @SuppressWarnings("checkstyle:AbbreviationAsWordInName") | ||
| @Tag(Tags.RequiresFDB) | ||
| @Tag(Tags.Slow) | ||
| public class HNSWTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division Into tests here actually suggests how the HNSW test might be broken up to be more manageable in size, and then unit tested in parts.
267e633 to
1963a0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this adds more to the review, in particular focusing on the storage serialization/deserialization. I still have:
- The changes to the other packages
- Tests
- Looking at updates since the last review
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AccessInfo.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Subspace for (mostly) statistical analysis (like finding a centroid, etc.). Contains samples of vectors. | ||
| */ | ||
| byte SUBSPACE_PREFIX_SAMPLES = 0x03; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to 0x02? I see you have 0x00, 0x01, and 0x03.
It also might be a bit misleading to make these bytes, as I would assume that with a byte, we'd store things under specific byte prefixes. Something like:
public Subspace getDataSubspace() {
byte[] baseSubspace = getSubspace().pack();
byte[] dataSubspace = ByteArrayUtil.join(baseSubspace, new byte[]{SUBSPACE_PREFIX_DATA});
return new Subspace(dataSubspace):
}That's not what happens, and I think it's not what we'd want to happen either. Instead, we do something like:
public Subspace getDataSubspace() {
return getSubspace().subspace(Tuple.from(SUBSPACE_PREFIX_DATA);
}That would actually implicitly cast SUBSPACE_PREFIX_DATA as an integer, and then use the tuple encoding for the integer value. Which actually is probably better as it gives us a bit more flexibility and enforces the invariant that all of our keys are tuple parseable (see: #3566 (comment)), but if we do do that, we might as well make these constants longs or ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people have trouble counting. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to make them int or long? I kind of like these things as tight as possible type wise. So unless we want to at some point use a 0xDEADBEEF or something I kind of like it better this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking longs. That is what we use in FDBRecordStoreKeyspace, for example, as well as what the KeySpacePath expects for integral tuple fields.
I'm not sure I buy your argument here that its better if the Java object here have fewer bits. In particular, I think byte is more appropriate if we're somehow leveraging the fact that it's already a byte in the encoding (e.g., appending it to some byte[]), not if it's just some integral value, like it is here. But I guess it doesn't really matter at the end of the day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change it as suggested.
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractStorageAdapter.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/InliningStorageAdapter.java
Outdated
Show resolved
Hide resolved
| @Nonnull final NeighborsChangeSet<NodeReference> neighborsChangeSet) { | ||
| final byte[] key = getDataSubspace().pack(Tuple.from(layer, node.getPrimaryKey())); | ||
|
|
||
| final List<Object> nodeItems = Lists.newArrayListWithExpectedSize(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another serialization form that we'd probably use protobuf for if protobuf was available to us. We don't really gain anything by using tuples here, as we don't need the values to be sorted. There's some amount of overhead to tuples that comes from the fact that their serialization needs to preserve order, but not that much. It's probably fine to continue using tuples unless we wanted to use something like protobuf for some other kind of benefit (like a clearer evolution path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageTransform.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/linear/AffineOperator.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/Config.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Outdated
Show resolved
Hide resolved
| * @param maxNumRead the maximum number of nodes to return in this scan | ||
| * @return an {@link Iterable} that provides the nodes found in the specified layer range | ||
| */ | ||
| Iterable<AbstractNode<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return an AsyncIterable? The implementation seem to actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation in InliningStorageAdapter actually produces a regular Iterable which is tricky to make without creating an asynchronous reduce operator on AsyncIterable.
0a243ee to
bedfd30
Compare
|
@normen662 @alecgrieser @MMcM Looking at the coverage report from the actions, I think the test gaps in teamscale are incorrect, but I would trust the findings, at least mostly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I overall like what's being done with Transform. I did leave one comment about a usage pattern that is a bit surprising, if not understandable. I looked at the tests, and they seem like a good set of basic high-level tests. I'm not sure off the top of my head what improvements I'd like to see, but it does seem like we should stress it a bit more. It may also be the kind of thing where if we took the current version and then devised more interesting testing strategies, that would be fine
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/ResultEntry.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * The vector that is stored with the item in the index. This vector is expressed in the client's coordinate | ||
| * system and should not be of class {@link com.apple.foundationdb.linear.HalfRealVector}, | ||
| * {@link com.apple.foundationdb.linear.FloatRealVector}, or {@link com.apple.foundationdb.linear.DoubleRealVector}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type should it be if not these types? Encoded real vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there was a "not" in the sentence that didn't make sense. To your question: It could be something like EncodedVector on paper. However, due to the final inverted transformation you will always end up with a "regular" vector.
| * {@link StorageTransform}. | ||
| */ | ||
| @Nullable | ||
| private final RealVector centroid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that confused me is that this is actually the rotated anti-centroid, not the centroid. That is to say, once we rotate the user vectors, we can add this value to center them over the origin. I see in HNSW that we are, in fact, correctly computing it (multiplying the centroid by negative one and then rotating), but it would be helpful if we noted that in the name of the variable here. We could also go with translationVector or something, though the comments should probably note that this should be the rotated anti-centroid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being the centroids evil twin is a direct result of making the AffineOperator congruent with its math definition. I'll name it negatedCentroid to be less dramatic and add a comment why that is.
| @Nonnull final Tuple keyTuple, @Nonnull final Tuple valueTuple) { | ||
| final Tuple neighborPrimaryKey = keyTuple.getNestedTuple(2); // neighbor primary key | ||
| final Transformed<RealVector> neighborVector = | ||
| storageTransform.transform( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually confused me a bit why you were calling transform here. I believe it's correct, but it hinges on knowing that the underlying vector is only semantically modified if it is not an EncodedVector. (And also, the storageTransform is always the identity transformation if we are not using RaBitQ.) So it means that if we stored the non-transformed data (for example, we hadn't enabled RaBitQ on the HNSW yet), then this will transform the underlying data at read time. But if we had already done that, then we just keep what we read from storage.
It makes it seem more like ensureTransformed or something. Maybe that's a bit more obvious if this is pushed into the StorageAdapter. It's also possible that this is just tricky and there's not much we can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assessment is exactly right. I don't think I can make this more concise but I will leave a comment in the code explaining this better.
| TimeUnit.NANOSECONDS.toMillis(endTs - beginTs), | ||
| onReadListener.getNodeCountByLayer(), onReadListener.getBytesReadByLayer(), | ||
| String.format(Locale.ROOT, "%.2f", recall * 100.0d)); | ||
| Assertions.assertThat(recall).isGreaterThan(0.79); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a pretty low recall. Is that all that we can get reliably? Or does it change with the value of k in a way that is hard to write assertions about? Or is it because you're using random vectors here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular test, this assert tolerates at most 2 wrong vectors in the result set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I increased the number of query vectors to have better fine-grained control. That also allowed me to get to > 0.9.
fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWTest.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWTest.java
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| @SuperSlow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be done in SuperSlow mode? Is there a variant where we load, say, a subset of the data set and we're able to do that in normal tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test takes exactly 3min10sec to run. It should be fine to not run it under @SuperSlow. If you feel like that's too long, there is not much we can do as this is a real dataset and this is its smallest size. We could filter this dataset to something even smaller but then we would have to host it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, it is not fine. I will put it under @SuperSlow again. I think the existing coverage without this always being enabled should be enough, though.
d6bc44c to
e83340d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few mores dates.
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/InliningNode.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/InliningStorageAdapter.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/InsertNeighborsChangeSet.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/NeighborsChangeSet.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/Node.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/OnReadListener.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/OnWriteListener.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java
Show resolved
Hide resolved
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/package-info.java
Outdated
Show resolved
Hide resolved
fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWTest.java
Outdated
Show resolved
Hide resolved
e83340d to
0971796
Compare
3a13db0 to
26d135a
Compare
This PR implements the HNSW paper using the recently introduced linear package together with RaBitQ.