Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

索引在写入ivfpq.100.faiss #70

Open
talk2much opened this issue Apr 24, 2024 · 6 comments
Open

索引在写入ivfpq.100.faiss #70

talk2much opened this issue Apr 24, 2024 · 6 comments

Comments

@talk2much
Copy link

Hello. Now I get the most of the need when the index files such as doclens.10.json,docnos.pkl.gz files But in the last step write in ivfpq.100.faiss file failed So I want to use the obtained file to write in the ivfpq.100.faiss file. My code is as follows:

indexer = ColBERTIndexer(checkpoint, "/home/yujy/code/Colbert_PRF/index", "robust04_index",skip_empty_docs=True,chunksize=6,ids=True)
# dataset = pt.get_dataset("trec-deep-learning-passages")
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index02(dataset.get_corpus_iter())
    def index02(self, iterator):
        docnos = []
        docid = 0
        def convert_gen(iterator):
            import pyterrier as pt
            nonlocal docnos
            nonlocal docid
            if self.num_docs is not None:
                iterator = pt.tqdm(iterator, total=self.num_docs, desc="encoding", unit="d")
            for l in iterator:
                l["docid"] = docid
                docnos.append(l['docno'])
                docid += 1
                yield l
        self.args.generator = convert_gen(iterator)
        index_faiss(self.args)
        print("#> Faiss encoding complete")

But it didn't work out and got stuck:

[ 21:30:28] #> Indexing the vectors...
[ 21:30:28] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...
@talk2much
Copy link
Author

1
I have all the files I need for the index except ivfpq.100.faiss

@Xiao0728
Copy link
Collaborator

Xiao0728 commented Apr 24, 2024

Hi,
It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())

@talk2much
Copy link
Author

Hi, It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())

I get an error when I don't change the code

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in _loader_thread
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in <listcomp>
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/index_manager.py", line 17, in load_index_part
    part = torch.load(filename)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

It looks like the thread won't start, right

@talk2much
Copy link
Author

The location of this error is

[4月 25, 10:52:31] #> Indexing the vectors...
[4月 25, 10:52:31] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...
Exception in thread Thread-2:
Traceback (most recent call last):

@talk2much
Copy link
Author

Hi, It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())

After debugging I found that the reason for the error is that all my pt files torch are not loading, so it will report an error:
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

@cmacdonald
Copy link
Collaborator

And what does Google say when you search for this error message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants