Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume building the index when the process crashes #65

Open
merokaa opened this issue Nov 28, 2023 · 4 comments
Open

Resume building the index when the process crashes #65

merokaa opened this issue Nov 28, 2023 · 4 comments

Comments

@merokaa
Copy link

merokaa commented Nov 28, 2023

Hi,

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.

I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.

@cmacdonald
Copy link
Collaborator

Hi @merokaa

Thanks for kind words about PyTerrier etc, we're glad you like it.

MSMARCO v2 passage is quite large for ColBERT. We once had a kinda distributed fork of pyterrier_colbert that was a bit more suited to this, but we didnt integrate it.

There is now ColBERT v2, which uses a much more compact index representation, and may be more suited for MSMARCOv2. We havent ported pyterrier_colbert to that - but assistance would be appreciated.

Sticking on the current ColBERT/pyterrier_colbert implementation, one other suggestion is simply splitting the corpus into chunks, and indexing those separately.

Subject then to enough memory (or using the mmap option for loading the indices - but that makes things pretty slow), you could combine the retrieval transformers using the + operator:

# indexing, roughly

ColBERTIndexer("index1").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[0:3_000_000] )
ColBERTIndexer("index2").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[3_000_000:6_000_000] )
# etc

# retrieval, roughly
colbert_slice_1 = Factory("index1").end_to_end()
colbert_slice_2 = Factory("index2").end_to_end()
colbert_slice_3 = Factory("index3").end_to_end()

combined = (colbert_slice_1 + colbert_slice_2 + colbert_slice_3) % 1000

This works because, I think, IRDS iterators are sliceable.

Note that ColBERT indices can also be pruned - see https://github.com/cmacdonald/colbert_static_pruning - we were able to throw away 55% of token embeddings in MSMARCO v1 without significant impact on effectiveness.

@seanmacavaney anything I missed?

@merokaa
Copy link
Author

merokaa commented Nov 30, 2023

Thank you so much @cmacdonald. That is very useful to know. I am now having some faiss issues when the data starts to be added to the index; it works fine on Colab but not on my machine :(. I'll get back to you once I resolve it.

@cmacdonald
Copy link
Collaborator

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

@merokaa
Copy link
Author

merokaa commented Dec 1, 2023

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

Sure, please do! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants