Improved use ScoreTracker to avoid wasteful searching for very large k #387

marianotepper · 2025-01-15T20:45:21Z

This improves upon #384 by making the quantiles estimation more lightweight. It models the recent scores as a Normal distribution and uses incremental updates to track sufficient statistics of its mean and variance. Then, quantiles are computed from these statistics.

…ocal maximum

…test.

… in RelaxedMonotonicityTracker. Hyperparameter tuning.

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java

jbellis · 2025-01-16T17:12:19Z

* This implementation does not consider the worstBestScore provided to shouldStop.

I think this comment must be left over from earlier changes?

jbellis · 2025-01-16T17:14:18Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/ScoreTracker.java

+         * @param bestScoredTracked the number of tracked scores used to estimate if we are unlikely to improve
+         *                          the results anymore. An empirical rule of thumb is bestScoredTracked=rerankK.
+         */
+        RelaxedMonotonicityTracker(int bestScoredTracked) {


typo in bestScoresTracked

jbellis · 2025-01-16T17:15:07Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/ScoreTracker.java

+            double windowPercentile = this.mean + SIGMA_FACTOR * std;
+            double worstBestScore = sortableIntToFloat((int) bestScores.top());
+            return windowPercentile < worstBestScore;
+//            return false;


jbellis · 2025-01-16T17:30:45Z

How does accuracy look for much smaller k/rrk? Like 5/10 or 10/20?

marianotepper · 2025-01-16T19:08:54Z

How does accuracy look for much smaller k/rrk? Like 5/10 or 10/20?

Will look into this. We should consider whether in those cases it is worth applying this technique.

marianotepper · 2025-01-17T19:00:44Z

The exploration savings are consistent for 5/10 or 10/20, but smaller. In the order of 5-10%

jbellis and others added 9 commits January 9, 2025 11:17

clarify

978a420

use scoreTracker to short circuit new edge evaluation once we hit a l…

829aa92

…ocal maximum

Streaming quantile estimator.

6dd4bbd

Add more tests

843590b

Tidy up TestStreamingQuantile

9375385

Add new RelaxedMonotonicityTracker. Remove StreamingQuantile and its …

8de68c4

…test.

Use the new ScoreTracker in GraphSearcher

f350800

Fix documentation in ScoreTracker

036590a

Fix a bug in the online computation of the mean. Tracking bestScores…

ce0c482

… in RelaxedMonotonicityTracker. Hyperparameter tuning.

marianotepper requested a review from jbellis January 15, 2025 20:45

marianotepper marked this pull request as ready for review January 15, 2025 20:46

Merge remote-tracking branch 'origin/main' into tracked

8a9a35d

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java

jbellis reviewed Jan 16, 2025

View reviewed changes

Code cleanup

240c898

jbellis approved these changes Jan 17, 2025

View reviewed changes

marianotepper merged commit 7cbb2e1 into main Jan 17, 2025
0 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved use ScoreTracker to avoid wasteful searching for very large k #387

Improved use ScoreTracker to avoid wasteful searching for very large k #387

marianotepper commented Jan 15, 2025

jbellis commented Jan 16, 2025

jbellis Jan 16, 2025

jbellis Jan 16, 2025

jbellis commented Jan 16, 2025

marianotepper commented Jan 16, 2025

marianotepper commented Jan 17, 2025

Improved use ScoreTracker to avoid wasteful searching for very large k #387

Improved use ScoreTracker to avoid wasteful searching for very large k #387

Conversation

marianotepper commented Jan 15, 2025

jbellis commented Jan 16, 2025

jbellis Jan 16, 2025

Choose a reason for hiding this comment

jbellis Jan 16, 2025

Choose a reason for hiding this comment

jbellis commented Jan 16, 2025

marianotepper commented Jan 16, 2025

marianotepper commented Jan 17, 2025