Skip to content

Conversation

@chenghao-guo
Copy link
Collaborator

close #66

Depends on this PR: lance-format/lance#5117

  • New Public API: lance_ray.create_index, is introduced as the primary entry point for building distributed vector indices, currently support distributed IVF_FLAT, IVF_SQ, and IVF_PQ indices.

The new create_index function orchestrates a multi-phase workflow:

  1. Global Training: It uses existing lance.IndicesBuilder to train IVF centroids and, if applicable, PQ codebooks on a sample of the dataset.
  2. Distributed Task Execution: Per-fragment index building tasks are distributed across a pool of Ray workers. Each worker receives the pre-trained models and processes a subset of the data fragments.
  3. Metadata Finalization: After all fragment-level indices are built, the main process merges the metadata and commits the new index to the dataset manifest.

@github-actions github-actions bot added the enhancement New feature or request label Dec 30, 2025
@chenghao-guo chenghao-guo self-assigned this Dec 30, 2025
@chenghao-guo chenghao-guo marked this pull request as ready for review December 30, 2025 07:04
@chenghao-guo
Copy link
Collaborator Author

chenghao-guo commented Dec 30, 2025

Testing Environment

We conducted tests using the imagenet_train dataset, which consists of 1,281,167 images. These images were embedded using a 2048-dimension doubao-embedding model, resulting in a dataset size of approximately 140GB after converting to lance.

  • Single-machine index building

    • Machine: 4 cores, 16GB RAM
    • Storage: object-store for original dataset
  • Distributed setup

    • Topology: 1 head node + 4 worker nodes
    • Each node: 4 cores, 16GB RAM
    • Purpose: use very cheap machines to extend overall runtime and evaluate performance improvements of distributed index building on low-config machines
  • Index building parameters

    • num_partitions = 1000
    • Data split into 253 fragments
    • Distributed processing: 4 workers in parallel

Testing Results

  • Unit: m = minute

IVF_SQ Group (Tested on S3)

Configuration Global Train IVF Time Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 11.25m 25.7m 36.95m
Single Machine 4C-16GB 11.5m 137.5m 149.0m

IVF_FLAT Group

Configuration Global Train IVF Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 10.68m 25.82m 36.5m
Single Machine 4C-16GB 11.8m OOM Killed OOM Killed

IVF_PQ Group

Configuration Global Train IVF Time Global PQ Training Time Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 10.53m 180.4m 48m 238.9m
Single Machine 4C-16GB 11.48m 171m 460.5m 642.98m

Overall Observation

Since it works on S3, network speed and S3 I/O may vary during the process. Sometimes it can take hours, but mainly due to the reasons below.

The distributed index was built using 4-core, 16GB machines (4 workers in total). We also checked if performance could scale linearly. The main reasons for the performance acceleration are:

  1. Avoiding backpressure throttle exceeded in IO on a single machine.
    We observed that when building an index on a low-performance machine, the lance process kept reporting:

    "backpressure throttle exceeded in IO"

  2. Distributed assignment allows using CPU resources from worker nodes to parallelize the process.

Conclusion

The test was carried out with a 4-core, 16GB machine and an S3 object-store. The results clearly show that the distributed setup significantly outperforms the single-machine setup in terms of index building time, especially for the Distribute Builder Time metric, for scaling linearly or better.

We can estimate speedup as:
For example, IVF_SQ total time​≈4.03×
Nearly scaling linearly


Limitation

Since PQ (Product Quantization) depends on:

  • global training of the Inverted File (IVF), and
  • PQ codebook training,

both of which are carried out on a single machine, if the training of the PQ codebook takes a long time, the acceleration effect of IVF-PQ may not be satisfactory. However, it is more suitable for a larger number of tables.

The most well-balanced and superior method appears to be IVF_SQ (Inverted File with Scalar Quantization), as this method can generally guarantee good recall and support distributed parallelism.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for making some conflicting changes, I rebased and this looks good to me

@jackye1995
Copy link
Contributor

@chenghao-guo can you make sure we add this to documentation? Another thing missing is we probably should propagate GPU configs like accelerator="cuda", which would be a key to take advantage of Ray's GPU native support.

@chenghao-guo
Copy link
Collaborator Author

@chenghao-guo can you make sure we add this to documentation? Another thing missing is we probably should propagate GPU configs like accelerator="cuda", which would be a key to take advantage of Ray's GPU native support.

Hi Jack, thanks a lot for the review. I’ll update the documentation accordingly in the md file.
For accelerator="cuda", I’ll also verify whether it works as expected before proceeding further.

@chenghao-guo chenghao-guo changed the title feat: support ray distributed IVF index builder feat!: support ray distributed IVF_SQ/PQ/FLAT index builder Jan 16, 2026
@chenghao-guo chenghao-guo merged commit 3f472a3 into lance-format:main Jan 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support build IVF index distributively in ray

2 participants