Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crash #56

Open
gcalabria opened this issue Nov 10, 2022 · 3 comments
Open

Kernel crash #56

gcalabria opened this issue Nov 10, 2022 · 3 comments

Comments

@gcalabria
Copy link

I have tried to run the vaswani.ipynb notebook in two different machines, but in both cases the kernel crashed in the same point. This is the log that I can see before the crash:

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Nov 10, 15:37:24] [0] 		 #> Local args.bsize = 128
[Nov 10, 15:37:24] [0] 		 #> args.index_root = ../content
[Nov 10, 15:37:24] [0] 		 #> self.possible_subset_sizes = [69905]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[Nov 10, 15:37:30] #> Loading model checkpoint.
[Nov 10, 15:37:30] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip

/home/s7949670/.local/lib/python3.8/site-packages/torch/hub.py:452: UserWarning: Falling back to the old format < 1.6. This support will be deprecated in favor of default zipfile format introduced in 1.6. Please redo torch.save() to save it in the new zipfile format.
  warnings.warn('Falling back to the old format < 1.6. This support will be '

[Nov 10, 15:37:46] #> checkpoint['epoch'] = 0
[Nov 10, 15:37:46] #> checkpoint['batch'] = 44500




[Nov 10, 15:37:53] #> Note: Output directory ../content already exists




[Nov 10, 15:37:53] #> Creating directory ../content/colbertindex 


vaswani documents: 100%|██████████| 11429/11429 [00:28<00:00, 399.40it/s]

[Nov 10, 15:38:04] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 61.2k (overall),  62.0k (this encoding),  12955.9M (this saving)
[Nov 10, 15:38:04] [0] 		 [NOTE] Done with local share.
[Nov 10, 15:38:04] [0] 		 #> Joining saver thread.
[Nov 10, 15:38:04] [0] 		 #> Saved batch #0 to ../content/colbertindex/0.pt 		 Saving Throughput = 3.3M passages per minute.

#> num_embeddings = 581496
[Nov 10, 15:38:04] #> Starting..
[Nov 10, 15:38:04] #> Processing slice #1 of 1 (range 0..1).
[Nov 10, 15:38:04] #> Will write to ../content/colbertindex/ivfpq.100.faiss.
[Nov 10, 15:38:04] #> Loading ../content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Nov 10, 15:38:04] Preparing resources for 1 GPUs.
[Nov 10, 15:38:04] #> Training with the vectors...
[Nov 10, 15:38:04] #> Training now (using 1 GPUs)...
0.4895319938659668
23.038629055023193
0.00026726722717285156
[Nov 10, 15:38:28] Done training!

[Nov 10, 15:38:28] #> Indexing the vectors...
[Nov 10, 15:38:28] #> Loading ('../content/colbertindex/0.pt', None, None) (from queue)...
[Nov 10, 15:38:28] #> Processing a sub_collection with shape (581496, 128)
[Nov 10, 15:38:28] Add data with shape (581496, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432

Does anyone has an idea of what the problem might be?

@cmacdonald
Copy link
Collaborator

This is almost certainly a FAISS version problem. Ensure you use conda to install it, and try version 1.6.5 or 1.6.3. Often the Conda doesnt have good matches for particular python and cuda versions.

@gcalabria
Copy link
Author

Thanks for the quick answer. I will try using conda, I was using pip previously

@cmacdonald
Copy link
Collaborator

Yes pip is not the solution - see https://github.com/terrierteam/pyterrier_colbert#installation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants