Skip to content

Conversation

@normen662
Copy link
Contributor

@normen662 normen662 commented Oct 22, 2025

This PR implements the HNSW paper using the recently introduced linear package together with RaBitQ.

@normen662 normen662 force-pushed the hnsw-on-linear branch 4 times, most recently from 4cfef2a to 3a04055 Compare October 24, 2025 16:49
@normen662 normen662 requested a review from alecgrieser October 28, 2025 19:17
@normen662 normen662 added the enhancement New feature or request label Oct 28, 2025
@normen662 normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from 8fcb3a6 to fc7994f Compare October 29, 2025 13:15
Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has obviously taken a bit of time, but this is part one of the review. It covers:

  1. The HNSW class and core algorithm
  2. The Node and NodeKind classes

Still yet to look at are:

  1. The StorageAdapter and implementations
  2. Change sets
  3. Any of the changes to the linear and RaBitQ packages
  4. All tests

As hopefully is clear in the review, a lot of what's in it are requests for clarification. Some of these should probably turn into comments.

I also think that it would be good to take another look at the teamscale findings. Most of those are also pretty minor, but it would be good to try to conform a bit more to them. I'm less concerned about things like method length, nesting, or number of parameters (especially for private methods), but it would be nice to take another look at them.

Overall, I think the approach makes sense, through. Nice!

@SuppressWarnings("checkstyle:AbbreviationAsWordInName")
@Tag(Tags.RequiresFDB)
@Tag(Tags.Slow)
public class HNSWTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division Into tests here actually suggests how the HNSW test might be broken up to be more manageable in size, and then unit tested in parts.

@normen662 normen662 force-pushed the hnsw-on-linear branch 2 times, most recently from 267e633 to 1963a0a Compare October 30, 2025 15:41
Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this adds more to the review, in particular focusing on the storage serialization/deserialization. I still have:

  1. The changes to the other packages
  2. Tests
  3. Looking at updates since the last review

/**
* Subspace for (mostly) statistical analysis (like finding a centroid, etc.). Contains samples of vectors.
*/
byte SUBSPACE_PREFIX_SAMPLES = 0x03;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to 0x02? I see you have 0x00, 0x01, and 0x03.

It also might be a bit misleading to make these bytes, as I would assume that with a byte, we'd store things under specific byte prefixes. Something like:

public Subspace getDataSubspace() {
    byte[] baseSubspace = getSubspace().pack();
    byte[] dataSubspace = ByteArrayUtil.join(baseSubspace, new byte[]{SUBSPACE_PREFIX_DATA});
    return new Subspace(dataSubspace):
}

That's not what happens, and I think it's not what we'd want to happen either. Instead, we do something like:

public Subspace getDataSubspace() {
    return getSubspace().subspace(Tuple.from(SUBSPACE_PREFIX_DATA);
}

That would actually implicitly cast SUBSPACE_PREFIX_DATA as an integer, and then use the tuple encoding for the integer value. Which actually is probably better as it gives us a bit more flexibility and enforces the invariant that all of our keys are tuple parseable (see: #3566 (comment)), but if we do do that, we might as well make these constants longs or ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people have trouble counting. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to make them int or long? I kind of like these things as tight as possible type wise. So unless we want to at some point use a 0xDEADBEEF or something I kind of like it better this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking longs. That is what we use in FDBRecordStoreKeyspace, for example, as well as what the KeySpacePath expects for integral tuple fields.

I'm not sure I buy your argument here that its better if the Java object here have fewer bits. In particular, I think byte is more appropriate if we're somehow leveraging the fact that it's already a byte in the encoding (e.g., appending it to some byte[]), not if it's just some integral value, like it is here. But I guess it doesn't really matter at the end of the day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it as suggested.

@Nonnull final NeighborsChangeSet<NodeReference> neighborsChangeSet) {
final byte[] key = getDataSubspace().pack(Tuple.from(layer, node.getPrimaryKey()));

final List<Object> nodeItems = Lists.newArrayListWithExpectedSize(3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another serialization form that we'd probably use protobuf for if protobuf was available to us. We don't really gain anything by using tuples here, as we don't need the values to be sorted. There's some amount of overhead to tuples that comes from the fact that their serialization needs to preserve order, but not that much. It's probably fine to continue using tuples unless we wanted to use something like protobuf for some other kind of benefit (like a clearer evolution path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @param maxNumRead the maximum number of nodes to return in this scan
* @return an {@link Iterable} that provides the nodes found in the specified layer range
*/
Iterable<AbstractNode<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an AsyncIterable? The implementation seem to actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in InliningStorageAdapter actually produces a regular Iterable which is tricky to make without creating an asynchronous reduce operator on AsyncIterable.

@normen662 normen662 force-pushed the hnsw-on-linear branch 6 times, most recently from 0a243ee to bedfd30 Compare November 3, 2025 13:24
@ScottDugas
Copy link
Collaborator

@normen662 @alecgrieser @MMcM
Teamscale is currently not reporting back to github
https://fdb.teamscale.io/activity/merge-requests/foundationdb-fdb-record-layer/FoundationDB%2Ffdb-record-layer%2F3691

Looking at the coverage report from the actions, I think the test gaps in teamscale are incorrect, but I would trust the findings, at least mostly.
You can see the summaries for changed files, which is pretty helpful for the new ones: https://github.com/FoundationDB/fdb-record-layer/actions/runs/19036192615

Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I overall like what's being done with Transform. I did leave one comment about a usage pattern that is a bit surprising, if not understandable. I looked at the tests, and they seem like a good set of basic high-level tests. I'm not sure off the top of my head what improvements I'd like to see, but it does seem like we should stress it a bit more. It may also be the kind of thing where if we took the current version and then devised more interesting testing strategies, that would be fine

/**
* The vector that is stored with the item in the index. This vector is expressed in the client's coordinate
* system and should not be of class {@link com.apple.foundationdb.linear.HalfRealVector},
* {@link com.apple.foundationdb.linear.FloatRealVector}, or {@link com.apple.foundationdb.linear.DoubleRealVector}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type should it be if not these types? Encoded real vector?

Copy link
Contributor Author

@normen662 normen662 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there was a "not" in the sentence that didn't make sense. To your question: It could be something like EncodedVector on paper. However, due to the final inverted transformation you will always end up with a "regular" vector.

* {@link StorageTransform}.
*/
@Nullable
private final RealVector centroid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that confused me is that this is actually the rotated anti-centroid, not the centroid. That is to say, once we rotate the user vectors, we can add this value to center them over the origin. I see in HNSW that we are, in fact, correctly computing it (multiplying the centroid by negative one and then rotating), but it would be helpful if we noted that in the name of the variable here. We could also go with translationVector or something, though the comments should probably note that this should be the rotated anti-centroid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the centroids evil twin is a direct result of making the AffineOperator congruent with its math definition. I'll name it negatedCentroid to be less dramatic and add a comment why that is.

@Nonnull final Tuple keyTuple, @Nonnull final Tuple valueTuple) {
final Tuple neighborPrimaryKey = keyTuple.getNestedTuple(2); // neighbor primary key
final Transformed<RealVector> neighborVector =
storageTransform.transform(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually confused me a bit why you were calling transform here. I believe it's correct, but it hinges on knowing that the underlying vector is only semantically modified if it is not an EncodedVector. (And also, the storageTransform is always the identity transformation if we are not using RaBitQ.) So it means that if we stored the non-transformed data (for example, we hadn't enabled RaBitQ on the HNSW yet), then this will transform the underlying data at read time. But if we had already done that, then we just keep what we read from storage.

It makes it seem more like ensureTransformed or something. Maybe that's a bit more obvious if this is pushed into the StorageAdapter. It's also possible that this is just tricky and there's not much we can do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assessment is exactly right. I don't think I can make this more concise but I will leave a comment in the code explaining this better.

TimeUnit.NANOSECONDS.toMillis(endTs - beginTs),
onReadListener.getNodeCountByLayer(), onReadListener.getBytesReadByLayer(),
String.format(Locale.ROOT, "%.2f", recall * 100.0d));
Assertions.assertThat(recall).isGreaterThan(0.79);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a pretty low recall. Is that all that we can get reliably? Or does it change with the value of k in a way that is hard to write assertions about? Or is it because you're using random vectors here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular test, this assert tolerates at most 2 wrong vectors in the result set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the number of query vectors to have better fine-grained control. That also allowed me to get to > 0.9.

}

@Test
@SuperSlow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be done in SuperSlow mode? Is there a variant where we load, say, a subset of the data set and we're able to do that in normal tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test takes exactly 3min10sec to run. It should be fine to not run it under @SuperSlow. If you feel like that's too long, there is not much we can do as this is a real dataset and this is its smallest size. We could filter this dataset to something even smaller but then we would have to host it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update, it is not fine. I will put it under @SuperSlow again. I think the existing coverage without this always being enabled should be enough, though.

@normen662 normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from d6bc44c to e83340d Compare November 4, 2025 19:26
Copy link
Collaborator

@MMcM MMcM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few mores dates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants