Resume building the index when the process crashes #65

merokaa · 2023-11-28T21:43:41Z

Hi,

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.

I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.

cmacdonald · 2023-11-28T21:57:53Z

Hi @merokaa

Thanks for kind words about PyTerrier etc, we're glad you like it.

MSMARCO v2 passage is quite large for ColBERT. We once had a kinda distributed fork of pyterrier_colbert that was a bit more suited to this, but we didnt integrate it.

There is now ColBERT v2, which uses a much more compact index representation, and may be more suited for MSMARCOv2. We havent ported pyterrier_colbert to that - but assistance would be appreciated.

Sticking on the current ColBERT/pyterrier_colbert implementation, one other suggestion is simply splitting the corpus into chunks, and indexing those separately.

Subject then to enough memory (or using the mmap option for loading the indices - but that makes things pretty slow), you could combine the retrieval transformers using the + operator:

# indexing, roughly

ColBERTIndexer("index1").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[0:3_000_000] )
ColBERTIndexer("index2").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[3_000_000:6_000_000] )
# etc

# retrieval, roughly
colbert_slice_1 = Factory("index1").end_to_end()
colbert_slice_2 = Factory("index2").end_to_end()
colbert_slice_3 = Factory("index3").end_to_end()

combined = (colbert_slice_1 + colbert_slice_2 + colbert_slice_3) % 1000

This works because, I think, IRDS iterators are sliceable.

Note that ColBERT indices can also be pruned - see https://github.com/cmacdonald/colbert_static_pruning - we were able to throw away 55% of token embeddings in MSMARCO v1 without significant impact on effectiveness.

@seanmacavaney anything I missed?

merokaa · 2023-11-30T23:55:19Z

Thank you so much @cmacdonald. That is very useful to know. I am now having some faiss issues when the data starts to be added to the index; it works fine on Colab but not on my machine :(. I'll get back to you once I resolve it.

cmacdonald · 2023-12-01T13:17:26Z

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

merokaa · 2023-12-01T18:13:20Z

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

Sure, please do! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume building the index when the process crashes #65

Resume building the index when the process crashes #65

merokaa commented Nov 28, 2023

cmacdonald commented Nov 28, 2023

merokaa commented Nov 30, 2023

cmacdonald commented Dec 1, 2023

merokaa commented Dec 1, 2023

Resume building the index when the process crashes #65

Resume building the index when the process crashes #65

Comments

merokaa commented Nov 28, 2023

cmacdonald commented Nov 28, 2023

merokaa commented Nov 30, 2023

cmacdonald commented Dec 1, 2023

merokaa commented Dec 1, 2023