-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume building the index when the process crashes #65
Comments
Hi @merokaa Thanks for kind words about PyTerrier etc, we're glad you like it. MSMARCO v2 passage is quite large for ColBERT. We once had a kinda distributed fork of pyterrier_colbert that was a bit more suited to this, but we didnt integrate it. There is now ColBERT v2, which uses a much more compact index representation, and may be more suited for MSMARCOv2. We havent ported pyterrier_colbert to that - but assistance would be appreciated. Sticking on the current ColBERT/pyterrier_colbert implementation, one other suggestion is simply splitting the corpus into chunks, and indexing those separately. Subject then to enough memory (or using the mmap option for loading the indices - but that makes things pretty slow), you could combine the retrieval transformers using the # indexing, roughly
ColBERTIndexer("index1").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[0:3_000_000] )
ColBERTIndexer("index2").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[3_000_000:6_000_000] )
# etc
# retrieval, roughly
colbert_slice_1 = Factory("index1").end_to_end()
colbert_slice_2 = Factory("index2").end_to_end()
colbert_slice_3 = Factory("index3").end_to_end()
combined = (colbert_slice_1 + colbert_slice_2 + colbert_slice_3) % 1000 This works because, I think, IRDS iterators are sliceable. Note that ColBERT indices can also be pruned - see https://github.com/cmacdonald/colbert_static_pruning - we were able to throw away 55% of token embeddings in MSMARCO v1 without significant impact on effectiveness. @seanmacavaney anything I missed? |
Thank you so much @cmacdonald. That is very useful to know. I am now having some faiss issues when the data starts to be added to the index; it works fine on Colab but not on my machine :(. I'll get back to you once I resolve it. |
Can I tweet this? It's a wonderful quote! |
Sure, please do! :) |
Hi,
I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.
I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.
I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.
The text was updated successfully, but these errors were encountered: