feat!: support ray distributed IVF_SQ/PQ/FLAT index builder #67

chenghao-guo · 2025-12-30T06:26:58Z

close #66

Depends on this PR: lance-format/lance#5117

New Public API: lance_ray.create_index, is introduced as the primary entry point for building distributed vector indices, currently support distributed IVF_FLAT, IVF_SQ, and IVF_PQ indices.

The new create_index function orchestrates a multi-phase workflow:

Global Training: It uses existing lance.IndicesBuilder to train IVF centroids and, if applicable, PQ codebooks on a sample of the dataset.
Distributed Task Execution: Per-fragment index building tasks are distributed across a pool of Ray workers. Each worker receives the pre-trained models and processes a subset of the data fragments.
Metadata Finalization: After all fragment-level indices are built, the main process merges the metadata and commits the new index to the dataset manifest.

chenghao-guo · 2025-12-30T09:57:52Z

Testing Environment

We conducted tests using the imagenet_train dataset, which consists of 1,281,167 images. These images were embedded using a 2048-dimension doubao-embedding model, resulting in a dataset size of approximately 140GB after converting to lance.

Single-machine index building
- Machine: 4 cores, 16GB RAM
- Storage: object-store for original dataset
Distributed setup
- Topology: 1 head node + 4 worker nodes
- Each node: 4 cores, 16GB RAM
- Purpose: use very cheap machines to extend overall runtime and evaluate performance improvements of distributed index building on low-config machines
Index building parameters
- num_partitions = 1000
- Data split into 253 fragments
- Distributed processing: 4 workers in parallel

Testing Results

Unit: m = minute

IVF_SQ Group (Tested on S3)

Configuration	Global Train IVF Time	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	11.25m	25.7m	36.95m
Single Machine `4C-16GB`	11.5m	137.5m	149.0m

IVF_FLAT Group

Configuration	Global Train IVF	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	10.68m	25.82m	36.5m
Single Machine `4C-16GB`	11.8m	OOM Killed	OOM Killed

IVF_PQ Group

Configuration	Global Train IVF Time	Global PQ Training Time	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	10.53m	180.4m	48m	238.9m
Single Machine `4C-16GB`	11.48m	171m	460.5m	642.98m

Overall Observation

Since it works on S3, network speed and S3 I/O may vary during the process. Sometimes it can take hours, but mainly due to the reasons below.

The distributed index was built using 4-core, 16GB machines (4 workers in total). We also checked if performance could scale linearly. The main reasons for the performance acceleration are:

Avoiding backpressure throttle exceeded in IO on a single machine.
We observed that when building an index on a low-performance machine, the lance process kept reporting:

"backpressure throttle exceeded in IO"
Distributed assignment allows using CPU resources from worker nodes to parallelize the process.

Conclusion

The test was carried out with a 4-core, 16GB machine and an S3 object-store. The results clearly show that the distributed setup significantly outperforms the single-machine setup in terms of index building time, especially for the Distribute Builder Time metric, for scaling linearly or better.

We can estimate speedup as:
For example, IVF_SQ total time≈4.03×
Nearly scaling linearly

Limitation

Since PQ (Product Quantization) depends on:

global training of the Inverted File (IVF), and
PQ codebook training,

both of which are carried out on a single machine, if the training of the PQ codebook takes a long time, the acceleration effect of IVF-PQ may not be satisfactory. However, it is more suitable for a larger number of tables.

The most well-balanced and superior method appears to be IVF_SQ (Inverted File with Scalar Quantization), as this method can generally guarantee good recall and support distributed parallelism.

…2.0.0" This reverts commit 2fcd606.

jackye1995

sorry for making some conflicting changes, I rebased and this looks good to me

jackye1995 · 2026-01-15T18:56:29Z

@chenghao-guo can you make sure we add this to documentation? Another thing missing is we probably should propagate GPU configs like accelerator="cuda", which would be a key to take advantage of Ray's GPU native support.

chenghao-guo · 2026-01-16T02:12:16Z

@chenghao-guo can you make sure we add this to documentation? Another thing missing is we probably should propagate GPU configs like accelerator="cuda", which would be a key to take advantage of Ray's GPU native support.

Hi Jack, thanks a lot for the review. I’ll update the documentation accordingly in the md file.
For accelerator="cuda", I’ll also verify whether it works as expected before proceeding further.

github-actions bot added the enhancement New feature or request label Dec 30, 2025

chenghao-guo self-assigned this Dec 30, 2025

chenghao-guo force-pushed the ivf_index branch from 8d9705e to e271a7c Compare December 30, 2025 06:28

chenghao-guo marked this pull request as ready for review December 30, 2025 07:04

chenghao-guo mentioned this pull request Dec 30, 2025

feat: support create vector index distributedly lance-format/lance#5117

Merged

chenghao-guo force-pushed the ivf_index branch 2 times, most recently from 10e3078 to ac0be16 Compare January 15, 2026 06:23

chenghao-guo added 3 commits January 15, 2026 10:47

feat: support ray distributed IVF index builder

7f17d38

test: skip testing distribute ivf index when pylance version<2.0.0

e1aaf26

Revert "test: skip testing distribute ivf index when pylance version<…

b25d2c9

…2.0.0" This reverts commit 2fcd606.

jackye1995 force-pushed the ivf_index branch from ac0be16 to b25d2c9 Compare January 15, 2026 18:47

jackye1995 approved these changes Jan 15, 2026

View reviewed changes

fix: update uri and doc change

fb05425

chenghao-guo changed the title ~~feat: support ray distributed IVF index builder~~ feat!: support ray distributed IVF_SQ/PQ/FLAT index builder Jan 16, 2026

github-actions bot added the breaking-change label Jan 16, 2026

chenghao-guo merged commit 3f472a3 into lance-format:main Jan 16, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: support ray distributed IVF_SQ/PQ/FLAT index builder #67

feat!: support ray distributed IVF_SQ/PQ/FLAT index builder #67

Uh oh!

chenghao-guo commented Dec 30, 2025

Uh oh!

chenghao-guo commented Dec 30, 2025 •

edited

Loading

Uh oh!

jackye1995 left a comment

Uh oh!

jackye1995 commented Jan 15, 2026

Uh oh!

chenghao-guo commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat!: support ray distributed IVF_SQ/PQ/FLAT index builder #67

feat!: support ray distributed IVF_SQ/PQ/FLAT index builder #67

Uh oh!

Conversation

chenghao-guo commented Dec 30, 2025

Uh oh!

chenghao-guo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing Environment

Testing Results

IVF_SQ Group (Tested on S3)

IVF_FLAT Group

IVF_PQ Group

Overall Observation

Conclusion

Limitation

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Jan 15, 2026

Uh oh!

chenghao-guo commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghao-guo commented Dec 30, 2025 •

edited

Loading