This repository is structured in the following way:
Code/
: Contains the code for computing scores (entity_score.py), a notebook for the visualisation (Embedding_quality.ipynb), and two scripts for scoring (rankscore.sh and ranklib_to_trec.py). It is updated with two additional notebooks for ranking with the other graph embedding methods and ranking with different scenarios.Data/
: Contains the linked entities used and the wikipedia redirects used, updated with more entity linking methods, and ground truth annotationsRuns/
: Contains all the runs used in the paper, updated with additional runs for other methods
Running the code requires Python 3.
If one simply wants to download the embeddings, they can be accessed:
Wikipedia2vec embeddings with graph component, result of the following command:
wikipedia2vec train --min-entity-count 0 --disambi enwiki-20190701-pages-articles-multistream.xml.bz2 wikipedia2vec_trained
and Wikipedia2vec embeddings without graph component , result of the following command:
wikipedia2vec train --min-entity-count 0 --disambi --no-link-graph enwiki-20190701-pages-articles-multistream.xml.bz2 wikipedia2vec_trained
The embeddings can then be loaded in Python with Gensim:
import gensim
model = gensim.models.KeyedVectors.load("WKN-vectors.bin", mmap='r')
Other embeddings can be downloaded here:
ComPlex with pagelinks KGE Graph files for ComPlex
ComPlex without pagelinks KGE Graph files for ComPlex
Wikipedia2Vec 500D trained on 2015
All these files need to be unzipped in the /src
directory.
To download all the auxilary files (Ranklib, DBpedia Entity V2 and the embeddings), please use the following command:
bash build.sh
To then reproduce the results, first make sure to install all the neccessary packages with:
pip install -r requirements.txt
If you want to run the ranking with ComplEx, it is first neccessary to install KGE. They do not have a pip install, so please follow the installation guide on their official Github
And then run
bash Code/reproduce.sh
The results will be stored in the folder /Output
To compute just the embedding based score, use the following function:
python Code/entity_score.py embeddingfile outputfile [pathtodbpedia]
python Code/entity_score.py src/WKN-vectors/WKN-vectors.bin output.txt src/DBpedia-Entity/runs/v2/bm25f-ca_v2.run
Code for computing the embedding based score with RDF2vec, Complex and old versions of Wikipedia2Vec
Open the Jupyter Notebook called score_multiple-embeddings-types.ipynb
in the Code
directory.
In the first cell, please comment out the lines specifying the preferred version of embeddings and annotations, following the instructions written there.
Then simply run all cells to reproduce the experiment.
Open the Jupyter Notebook called score_multiple-embedding-types-with-scenarios.ipynb
in the Code
directory.
Simply run all cells to reproduce the experiment.
If you want to run Ranklib with 5 folds afterwards, use the following function:
python Code/entity_score_folds.py embeddingfile outputfolder outputfile [pathtodbpedia]
So for example
python Code/entity_score_folds.py src/WKN-vectors/WKN-vectors.bin Outputfolder output.txt src/DBpedia-Entity/runs/v2/bm25f-ca_v2.run
To do the coordinate ascent and ranking of these files, please run the following script with the Outputfolder from the previous line:
bash Code/train_ranklib.sh Outputfolder
bash Code/score_ranklib.sh Outputfolder
The first script will train Ranklib, and the second script will score according to Ranklib and will result in the ranking and the trec_eval scores of the ranking.
@inproceedings{Gerritse:2020:GEEER,
author = {Gerritse, Emma and Hasibi, Faegheh and De Vries, Arjen},
title = {Graph-Embedding Empowered Entity Retrieval},
booktitle={European Conference on Information Retrieval},
series = {ECIR '20},
year = {2020},
publisher = {Springer},
}
If you have any questions, please contact Emma Gerritse at [email protected]